PBcR: Read correction

From mn/ibv/bioinfwiki
Jump to: navigation, search

The main purpose of the PBcR program is the correction of PacBio long reads. The two main modes of error-correction are self-correction (where short PacBio reads are used to correct the longer reads) and error-correction using complementary data (i.e. high-quality Illumina reads).

When running PBcR, there is no explicit switch or input parameters which tells the program which mode of error correction to use. Rather, PBcR infers the intended error correction mode from the read files included: if Illumina data is present (in the form of a *.frg file), complementary read correction is used. Without a fragment file, self-correction is enabled.

Self-correction of PacBio reads

Self-correction of long PacBio reads with shorter reads is possible with 20X+ coverage of C2 sequencing data (or newer).

The categorization of reads into 'long' or 'short' reads is controlled by the -maxCoverage input parameter (only the longest sequences adding up to this coverage will be corrected). This requires the -genomeSize parameter to be specified; the default value of maxCoverage is 40X.

BLASR is automatically enabled when using self-correction. This means that the PacBio SMRT Analysis package must be loaded before running PBcR:

module load smrtanalysis/2.3.0

Alterantivly, the Bowtie2 program can be used. Again, this means loading the required module before executing PBcR:

module load bowtie2

Additionally, PBcR now supports a new algorithm that promises to achieve a 600-fold speed-up of the read-correction process. This algorithm, known as MHAP, is only available for the self-correction of PacBio reads. In order to run, it requires Java 1.8. This version of Java has now been installed on the Abel system (including Freebee). Also, it requires a high amount of memory; when read-correcting larger genomes it may be necessary to run PBcR on one of Abel's high-memory nodes. If these requirements are met, MHAP will be used for read correction. Otherwise, the BLASR algorithm will be employed.

Complementary (Illumina) read correction

30X to 50X coverage for illumina reads is required when doing error-correction of PacBio long reads. If using Illumina reads of 100 bps or shorter, the -shortReads option should be specified. Again, BLASR or Bowtie2 may optionally be used as mapping algorithms. MHAP, however, is not available for complementary read correction.

Note that further development of complementary red correction algorithms are not planned in PBcR (the same is the case for PacBio's SMRT Analysis package). In the long term, self-correction seems to be the preferred technology. Algorithms such as MHAP will help reduce the large CPU-hour overhead inherit in read correction, but is unlikely to be implemented for complementary read correction.

The consensus stage

After the mapping of short reads to the long reads, a consensus stage finds the corrected sequence. Two different algorithms can be employed for finding the consensus: the default falcon_sense or the external pbdagcon algorithms. In order to use pbdagcon, the SMRT Analysis package must be loaded as demonstrated above. pbdagcon is a more sensitive consensus finder, and can be beneficial if using a coverage below 60X. It can be switched on as part of the sensitive mode: