Error correction of Illumina reads using Celera Assembler

From mn/bio/cees-bioinf
Revision as of 13:59, 14 June 2013 by Olekto@uio.no (talk | contribs)

Jump to: navigation, search

It is usually useful to pre-process the data you have before assembly. An ideal dataset would be error free, contain only genomic sequences, and no artificial duplications or chimeric sequences. Removing these before assembly usually give a better assembly than not removing them, and the assembly process itself often runs quicker.

Some assemblers, like ALLPATHS-LG, is mostly self-contained, and does not need any pre-processing of the data. Celera Assembler is also mostly self-contained, but it is more flexible regarding which modules to run when, and you can often pre-process the data however you want it before running an assembly on it.

For this how-to, I chose part of the dataset for the budgerigar that was used in the Assemblathon 2 [1]. There is a lot of data for this bird, but I choose just a limited set for this how-to, the BGI sequenced Illumina data with 220 bp insert, 500 bp insert and 5000 bp insert sizes.

Since the data with 220 bp insert are overlapping, I also run FLASH [2] to merge the overlapping reads. The merged reads will then be error corrected.

The workflow is like this:

1. Get data

2a. Make a database of trusted kmers

2b. Merge overlapping reads

3. Error correct/trim/remove adapters of all the reads

The paired end reads will be ready for assembly, but I would remove duplicates and wrongly oriented reads (paired end contaminants) from the mate pair library.

I have set up a bitbucket repository with a makefile that contain all the commands you need to get through this how-to. Just do a

 git clone https://bitbucket.org/olekto/parrot_assembly.git

in a suitable location. I assume that both Celera Assembler [3] and FLASH [4] are in your path. I have used FLASH 1.2.3 and Celera Assembler from SVN June 11th, but later versions will probably give the same result.

If you just want to reproduce the whole thing, just do a

 make -f parrot_error_correct.make

We will start with downloading the data. Since there are 8 gzip'ed file that needs to be downloaded and gunzip'ed, I would just do a

 make -f parrot_error_correct.make get_data

which will run only the "get_data" entry in the make file, and will download and extract the files needed for the rest of the how-to.

meryl [5] is a program that is part of the Celera Assembler package and that is used to count kmers. It might not work on large fastq files (because of many reads), so we just put all the sequences from the paired end fastq files into a single fasta file with a lot less entries.

 cat parrot_220_1.fq | perl merge_seq.pl > trusted_kmers_pe.fa
 cat parrot_220_2.fq | perl merge_seq.pl >> trusted_kmers_pe.fa
 cat parrot_500_1.fq | perl merge_seq.pl >> trusted_kmers_pe.fa
 cat parrot_500_2.fq | perl merge_seq.pl >> trusted_kmers_pe.fa


WORK IN PROGRESS.