Error correction of Illumina reads using Celera Assembler

From mn/bio/cees-bioinf
Revision as of 15:47, 13 June 2013 by Olekto@uio.no (talk | contribs)

Jump to: navigation, search

It is usually useful to pre-process the data you have before assembly. An ideal dataset would be error free, contain only genomic sequences, and no artificial duplications or chimeric sequences. Removing these before assembly usually give a better assembly than not removing them, and the assembly process itself often runs quicker.

Some assemblers, like ALLPATHS-LG, is mostly self-contained, and does not need any pre-processing of the data. Celera Assembler is also mostly self-contained, but it is more flexible regarding which modules to run when, and you can often pre-process the data however you want it before running an assembly on it.

For this how-to, I chose part of the dataset for the budgerigar that was used in the Assemblathon 2: [1]. There is a lot of data for this bird, but I choose just a limited set for this how-to, the BGI sequenced Illumina data with 220 bp insert, 500 bp insert and 5000 bp insert sizes.


WORK IN PROGRESS.