Error correction of Illumina reads using Celera Assembler

From mn/bio/cees-bioinf
Revision as of 13:45, 14 June 2013 by Olekto@uio.no (talk | contribs)

Jump to: navigation, search

It is usually useful to pre-process the data you have before assembly. An ideal dataset would be error free, contain only genomic sequences, and no artificial duplications or chimeric sequences. Removing these before assembly usually give a better assembly than not removing them, and the assembly process itself often runs quicker.

Some assemblers, like ALLPATHS-LG, is mostly self-contained, and does not need any pre-processing of the data. Celera Assembler is also mostly self-contained, but it is more flexible regarding which modules to run when, and you can often pre-process the data however you want it before running an assembly on it.

For this how-to, I chose part of the dataset for the budgerigar that was used in the Assemblathon 2 [1]. There is a lot of data for this bird, but I choose just a limited set for this how-to, the BGI sequenced Illumina data with 220 bp insert, 500 bp insert and 5000 bp insert sizes.

Since the data with 220 bp insert are overlapping, I also run FLASH [2] to merge the overlapping reads. The merged reads will then be error corrected.

The workflow is like this: 1. Get data 2a. Make a database of trusted kmers 2b. Merge overlapping reads 3. Error correct/trim/remove adapters of all the reads

The paired end reads will be ready for assembly, but I would remove duplicates and wrongly oriented reads (paired end contaminants) from the mate pair library.

I've set up a bitbucket repository with a makefile that contain all the commands you need to get through this how-to. Just do a

 git clone https://bitbucket.org/olekto/parrot_assembly.git

in a suitable location. I assume that both Celera Assembler [wgs-assembler.sourceforge.net] and FLASH [3] are in your path.

WORK IN PROGRESS.