PBcR: Introduction

From mn/ibv/bioinfwiki
Jump to: navigation, search
The PacBio corrected Reads (PBcR) pipeline is as an open-source program primarily intended for the read-correction of long PacBio reads. Due to the sequencing technology used, such reads can reach great lengths (>60 Kb), but are of comparable low quality. The error rate is typically around 15%, with errors found predominantly in homopolymer runs (i.e. Indel-type errors). In addition to PacBio reads, it may also be used for other long-read technologies that have error characteristics similar to PacBio long reads (such as the Oxford Nanopore MinION sequencer). 

While PBcR performs error correction, it will optionally also assemble the provided reads. In both read correction and assembly, it is build on the Celera Assembler (CA), something which is reflected in both input file formats and output folder structure. The PBcR program is a flexible pipeline and can be configured to include other tools (if available) such as BLASR and Bowtie2.

For genome assembly using PacBio reads, read correction is a major part of the assembly process. Due to the high error rate in PacBio reads, it is of vital importance for the final quality of the assembly. Still, it represents a major bottleneck in the assembly process. In an assembly of a bacterial genome, using PacBio long reads together with Illumina short reads, more than 90% of CPU-hours used can be spend on read correction alone. The current version of PBcR contains a new algorithm (MHAP) that promises to reduce the computational overhead of read correction. At the time of writing, this algorithm is only applicable to the self-correction of long reads. Also, it is not (yet?) included in the SMART Ananlysis package that constitutes PacBios own software solution.