Error correction of Illumina reads using Celera Assembler

From mn/bio/cees-bioinf
Jump to: navigation, search

It is usually useful to pre-process the data you have before assembly. An ideal dataset would be error free, contain only genomic sequences, and no artificial duplications or chimeric sequences. Removing these before assembly usually give a better assembly than not removing them, and the assembly process itself often runs quicker.

Some assemblers, like ALLPATHS-LG, is mostly self-contained, and does not need any pre-processing of the data. Celera Assembler is also mostly self-contained, but it is more flexible regarding which modules to run when, and you can often pre-process the data however you want it before running an assembly on it.

For this how-to, I chose part of the dataset for the budgerigar that was used in the Assemblathon 2 [1]. There is a lot of data for this bird, but I choose just a limited set for this how-to, the BGI sequenced Illumina data with 220 bp insert, 500 bp insert and 5000 bp insert sizes.

Since the data with 220 bp insert are overlapping, I also run FLASH [2] to merge the overlapping reads. The merged reads will then be error corrected.

The workflow is like this:

1. Get data

2a. Make a database of trusted kmers

2b. Merge overlapping reads

3. Error correct/trim/remove adapters of all the reads

The paired end reads will be ready for assembly, but I would remove duplicates and wrongly oriented reads (paired end contaminants) from the mate pair library.

I have set up a bitbucket repository with a makefile that contain all the commands you need to get through this how-to. Just do a

 git clone https://bitbucket.org/olekto/parrot_assembly.git

in a suitable location. I assume that both Celera Assembler [3] and FLASH [4] are in your path. I have used FLASH 1.2.3 and Celera Assembler from SVN June 11th 2013, with a small difference that in merTrim, sequences that has been found without any valid kmers are not empty any longer, but has a single N. Later versions will probably give the same result.

If you just want to reproduce the whole thing, just do a

 make -f parrot_error_correct.make

We will start with downloading the data. Since there are 8 gzip'ed file that needs to be downloaded and gunzip'ed, I would just do a

 make -f parrot_error_correct.make get_data

which will run only the "get_data" entry in the make file, and will download and extract the files needed for the rest of the how-to.

meryl [5] is a program that is part of the Celera Assembler package and that is used to count kmers. It might not work on large fastq files (because of many reads), so we just put all the sequences from the paired end fastq files into a single fasta file with a lot less entries.

 cat raw_data/parrot_220_1.fq | perl merge_seq.pl > trusted_kmers_pe.fa
 cat raw_data/parrot_220_2.fq | perl merge_seq.pl >> trusted_kmers_pe.fa
 cat raw_data/parrot_500_1.fq | perl merge_seq.pl >> trusted_kmers_pe.fa
 cat raw/data/parrot_500_2.fq | perl merge_seq.pl >> trusted_kmers_pe.fa

The merge_seq.pl script is in the bitbucket repository. It basically just put the sequences for the reads from the fastq file after each other with a single 'N' between them. The 'N' is not a valid nucleotide, and stop meryl from counting kmers across reads. When there are more than 1 000 000 nucleotides after each other, merge_seq.pl will start a new entry.

After putting all the sequences for the reads into a single file, we need to count them with meryl.

 meryl -B -m 22 -memory 250000 -threads 32 -C -s trusted_kmers_pe.fa -o trusted_kmers

The -B option tell meryl to build a table, -m sets the kmer size, -memory sets the memory limit in MB, -threads the number of threads (surprise!), -C make meryl count both strands (I assume it reverse complements the kmer and count that too), -s just point to the input fasta/fastq file and -o sets the output prefix.

We also need to merge the overlapping reads with FLASH:

 flash -r 150 -f 220 -s 22 parrot_220_1.fq parrot_220_2.fq -o parrot_220 > merged_parrot_220.out 2>&1

FLASH is set up to default work with 100 bp long reads from 180 bp large fragments, so we just have to tell it that these are 150 bp long reads from 220 bp large fragments with a standard deviation of 22. Then we specify the input files, and set the output prefix.

One reason of running FLASH before error correction is that, or any other tool that merge overlapping reads, is that with FLASH you only look at two reads at a time and the program know that these should be overlap with about 80 bp in this case (20 bp from a library with 180 bp insert size and 100 bp reads), with a standard deviation of about 22 bp. It is easier for FLASH to find the overlap even with some mismatches, and it can then use whichever of the bases that overlap that has the best quality to put into the finished read. The result is a 220 bp read, with maybe a small quality drop in the middle. If we had run error correction first, the last bases of each read might have been trimmed off. merTrim does trim off the last bases if they have quality 2 for example, while FLASH would have tried to utilize them. Basically, FLASH help us getting the best quality reads out when running it before merTrim.

After running FLASH, we get down to the error correction. These are all the commands that needs to be run:

 merTrim -F parrot_220.extendedFrags.fastq -m 22 -mc trusted_kmers -mCillumina -t 32 -o trimmed_and_corrected_reads/parrot_220.extendedFrags_mT.fq 1> parrot_220_ext_mT.out 2> parrot_220_ext_mT.err
 merTrim -F parrot_220.notCombined_1.fastq -m 22 -mc trusted_kmers -mCillumina -t 32 -o trimmed_and_corrected_reads/parrot_220.notCombined_1_mT.fq 1> parrot_220_comb_1_mT.out 2> parrot_220_comb_1_mT.err
 merTrim -F parrot_220.notCombined_2.fastq -m 22 -mc trusted_kmers -mCillumina -t 32 -o trimmed_and_corrected_reads/parrot_220.notCombined_2_mT.fq 1> parrot_220_comb_2_mT.out 2> parrot_220_comb_2_mT.err
 merTrim -F raw_data/parrot_500_1.fq -m 22 -mc trusted_kmers -mCillumina -t 32 -o trimmed_and_corrected_reads/parrot_500_1_mT.fq 1> parrot_500_1_mT.out 2> parrot_500_1_mT.err
 merTrim -F raw_data/parrot_500_2.fq -m 22 -mc trusted_kmers -mCillumina -t 32 -o trimmed_and_corrected_reads/parrot_500_2_mT.fq 1> parrot_500_2_mT.out 2> parrot_500_2_mT.err
 merTrim -F raw_data/parrot_5k_1_1.fq -m 22 -mc trusted_kmers -mCillumina -t 32 -o trimmed_and_corrected_reads/parrot_5k_1_1_mT.fq 1> parrot_5k_1_1_mT.out 2> parrot_5k_1_1_mT.err
 merTrim -F raw_data/parrot_5k_1_2.fq -m 22 -mc trusted_kmers -mCillumina -t 32 -o trimmed_and_corrected_reads/parrot_5k_1_2_mT.fq 1> parrot_5k_1_2_mT.out 2> parrot_5k_1_2_mT.err
 merTrim -F raw_data/parrot_5k_2_1.fq -m 22 -mc trusted_kmers -mCillumina -t 32 -o trimmed_and_corrected_reads/parrot_5k_2_1_mT.fq 1> parrot_5k_2_1_mT.out 2> parrot_5k_2_1_mT.err
 merTrim -F raw_data/parrot_5k_2_2.fq -m 22 -mc trusted_kmers -mCillumina -t 32 -o trimmed_and_corrected_reads/parrot_5k_2_2_mT.fq 1> parrot_5k_2_2_mT.out 2> parrot_5k_2_2_mT.err	

merTrim will correct all these reads, and remove the adaptors. -F specify the input file, -m sets the kmer size (22), -mc points to a meryl database with trusted kmers, -mCillumina tell merTrim to remove Illumina adapters (as long as they are longer than the kmer size), -t sets the number of threads, -o specify where the output should be put. And I redirect the standard output and error from each of these runs to their own files.

After this error correction, the paired end should be ready for use in assembly, while we want to remove paired end contamination in the mate pair library before we use that in assembly (and remove duplications in the library). There is several ways to do this, Celera Assembler has a denovo classification pipeline that removes the paired end contamination, and a separate pipeline for removal of duplications, but we will make an assembly of just the paired end reads and then map the mate pairs back to this assembly to remove the unwanted reads.

WORK IN PROGRESS.