PBcR: Executing the pipeline

From mn/ibv/bioinfwiki
Jump to: navigation, search

Introduction

When executing, the PBcR pipeline typically performs both read correction and subsequently assembles the corrected reads. It is worth noticing that the read-correction step consumes quite a bit more CPU-hours than does assembly with Celera Assembler (CA). In fact, if doing complementary read correction using Illumina reads, more than 99% of computing time is spend on the read correction step. If planning to assemble large genomes and/or utilizing a great sequencing depth, it is well worth considering to use only PacBio reads, as self-correction (especially using MHAP) is much faster than using Illumina reads. If desirable, the PBcR pipeline can stop after the read correction step by specifying:

assemble=0

The structure of the PBcR output folder reflects how far into the assembly CA has progressed. This means that CA is able to recover from a crash, i.e. the assembler is able to continue from the step it had reached when the crash occured. Obviously, this does not include error that are caused by internal problems such as incorrect data. This also means that is is important to start each distinct run of PBcR in a new empty folder (or alternatively delete all previous output files before running the pipeline). Otherwise, the program will not re-computer possibly changed input data, but rather continue from where it had stoped at the last run.

Also, PBcR will write its output relative to the current folder (i.e. to the folder from where it was called). This means you should create your output folder, step into it, and then execute PBcR. Since PBcR is capable of using the BLASR mapping algorithm (see here), the SMRT Analysis package containing this algorithm may be loaded prior to executing PBcR. (I have seen that PBcR reports an error if not loading BLASR, even though it should be capable of running its own mapper instead of BLASR if the latter is not available...)

Getting the PBcR help file

The PBcR help file is available by calling PBcR without arguments:

/path/to/PBcR/bin/PBcR

Executing PBcR using self-correction of PacBio reads

PBcR will attempt to use the new MHAP algorithm if performing self-correction of PacBio reads. This will result in a great reduction in execution time. The MHAP algorithm depends on Java version 1.8. Previously, there have been problems in loading this Java version on Abel - these issues now appear to be solved (i.e. Java version 1.8 should now be installed on the Abel frontend, the cluster itself and also on freebee). It is not possible to combine running the MHAP algorithm with loading the SMRT Analysis package (for instance in order to use the BLASR algorithm for assembly). This because SMRT Analysis implements its own Java environment using version 1.7; it would override the Java version 1.8 installed on Abel. Additionally, the self-correction pipeline needs access to the the Statistics/Descriptive.pm perl module. This is accessable if loading the appropriate module (perlmodules) before running PBcR. Thus the following will perform self-correction and assembly of PacBio reads on Abel:

mkdir outfolder

cd outfolder

module load perlmodules

PBcR/wgs-8.3rc1/Linux-amd64/bin/PBcR -length 500 -partitions 200 -l libraryname -s /path/to/specfile.spec -fastq /path/to/pacbio.filtered_subreads.fastq genomeSize=50000 > PBcR_logfile.txt 2>&1

The above will execute PBcR, while writing its output to a log file (PBcR_logfile.txt). The "2>&1" command at the end directs both the standard output and error output to the log file.

The MHAP read correction algorithm needs a lot of memory. 32GB of memory in general should be sufficient to index 20.000 sequences (see here).

Executing PBcR using Illumina read correction

This example assumes that an appropriate specification (*.spec) file has been created, that the PacBio reads are formatted as a fastq file  and that Illumina reads are present, wrapped up in a fragment (*.frg) file (for details see here):

mkdir outputDir cd outputDir

module load smrtanalysis/2.3.0

/path/to/PBcR/bin/PBcR -length 500 -partitions 200 -l lambdaIll -s /path/to/specfile.spec -fastq /path/to/pacbio.filtered_subreads.fastq genomeSize=50000 /path/to/illumina.frg > PBcR_logfile.txt 2>&1

The above will execute PBcR, while writing its output to a log file (PBcR_logfile.txt). The "2>&1" command at the end directs both the standard output and error output to the log file.