Difference between revisions of "PBcR: Executing the pipeline"

From mn/ibv/bioinfwiki
Jump to: navigation, search
Line 29: Line 29:
 
The above will execute PBcR, while writing its output to a log file (<span style="font-family:courier new,courier,monospace;">PBcR_logfile.txt</span>). The "2>&1" command at the end directs both the standard output and error output to the log file.
 
The above will execute PBcR, while writing its output to a log file (<span style="font-family:courier new,courier,monospace;">PBcR_logfile.txt</span>). The "2>&1" command at the end directs both the standard output and error output to the log file.
  
The MHAP read correction algorithm needs a lot of memory. Using the lambda virus test data set (see here), this rather small data set demanded the usage of almost 70 GB of memory for the Java process. Thus, it is likely that larger data sets must be run on Abel's high-memory nodes.
+
The MHAP read correction algorithm needs a lot of memory. Using the lambda virus test data set (see [http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Assembling_a_Lambda_Phage here]), this rather small data set demanded the usage of almost 70 GB of memory for the Java process. Thus, it is likely that larger data sets must be run on Abel's high-memory nodes.
  
 
= Executing PBcR using Illumina read correction =
 
= Executing PBcR using Illumina read correction =
  
This example assumes that an appropriate specification (<span style="font-family:courier new,courier,monospace;">*.spec</span>) file has been created, that the PacBio reads are formatted as a fastq file &nbsp;and that Illumina reads are present, wrapped up in a fragment (<span style="font-family:courier new,courier,monospace;">*.frg</span>) file (for details see [[PBcR:_Input_file_formats|here]]):
+
This example assumes that an appropriate specification (<span style="font-family:courier new,courier,monospace;">*.spec</span>) file has been created, that the PacBio reads are formatted as a fastq file &nbsp;and that Illumina reads are present, wrapped up in a fragment (<span style="font-family:courier new,courier,monospace;">*.frg</span>) file (for details see [[PBcR: Input file formats|here]]):
 
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;">
 
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;">
 
mkdir outputDir cd outputDir
 
mkdir outputDir cd outputDir

Revision as of 13:19, 4 May 2015

Introduction

When executing, the PBcR pipeline typically performs both read correction and subsequently assembles the corrected reads. It is worth noticing that the read-correction step consumes quite a bit more CPU-hours than does assembly with Celera Assembler (CA). In fact, if doing complementary read correction using Illumina reads, more than 99% of computing time is spend on the read correction step. If planning to assemble large genomes and/or utilizing a great sequencing depth, it is well worth considering to use only PacBio reads, as self-correction (especially using MHAP) is much faster than using Illumina reads. If desirable, the PBcR pipeline can stop after the read correction step by specifying:

assemble=0

The structure of the PBcR output folder reflects how far into the assembly CA has progressed. This means that CA is able to recover from a crash, i.e. the assembler is able to continue from the step it had reached when the crash occured. Obviously, this does not include error that are caused by internal problems such as incorrect data. This also means that is is important to start each distinct run of PBcR in a new empty folder (or alternatively delete all previous output files before running the pipeline). Otherwise, the program will not re-computer possibly changed input data, but rather continue from where it had stoped at the last run.

Also, PBcR will write its output relative to the current folder (i.e. to the folder from where it was called). This means you should create your output folder, step into it, and then execute PBcR. Since PBcR is capable of using the BLASR mapping algorithm (see here), the SMRT Analysis package containing this algorithm may be loaded prior to executing PBcR. (I have seen that PBcR reports an error if not loading BLASR, even though it should be capable of running its own mapper instead of BLASR if the latter is not available...)

Getting the PBcR help file

The PBcR help file is available by calling PBcR without arguments:

/path/to/PBcR/bin/PBcR

Executing PBcR using self-correction of PacBio reads

PBcR will attempt to use the new MHAP algorithm if performing self-correction of PacBio reads. This will result in a great reduction in execution time. The MHAP algorithm depends on Java version 1.8. Previously, there have been problems in loading this Java version on Abel - these issues now appear to be solved (i.e. Java version 1.8 should now be installed on the Abel frontend, the cluster itself and also on freebee). It is not possible to combine running the MHAP algorithm with loading the SMRT Analysis package (for instance in order to use the BLASR algorithm for assembly). This because SMRT Analysis implements its own Java environment using version 1.7; it would override the Java version 1.8 installed on Abel. Additionally, the self-correction pipeline needs access to the the Statistics/Descriptive.pm perl module. This is accessable if loading the appropriate module (perlmodules) before running PBcR. Thus the following will perform self-correction and assembly of PacBio reads on Abel:

mkdir outfolder

cd outfolder

module load perlmodules

PBcR/wgs-8.3rc1/Linux-amd64/bin/PBcR -length 500 -partitions 200 -l libraryname -s /path/to/specfile.spec -fastq /path/to/pacbio.filtered_subreads.fastq genomeSize=50000 > PBcR_logfile.txt 2>&1

The above will execute PBcR, while writing its output to a log file (PBcR_logfile.txt). The "2>&1" command at the end directs both the standard output and error output to the log file.

The MHAP read correction algorithm needs a lot of memory. Using the lambda virus test data set (see here), this rather small data set demanded the usage of almost 70 GB of memory for the Java process. Thus, it is likely that larger data sets must be run on Abel's high-memory nodes.

Executing PBcR using Illumina read correction

This example assumes that an appropriate specification (*.spec) file has been created, that the PacBio reads are formatted as a fastq file  and that Illumina reads are present, wrapped up in a fragment (*.frg) file (for details see here):

mkdir outputDir cd outputDir

module load smrtanalysis/2.3.0

/path/to/PBcR/bin/PBcR -length 500 -partitions 200 -l lambdaIll -s /path/to/specfile.spec -fastq /path/to/pacbio.filtered_subreads.fastq genomeSize=50000 /path/to/illumina.frg > PBcR_logfile.txt 2>&1

The above will execute PBcR, while writing its output to a log file (PBcR_logfile.txt). The "2>&1" command at the end directs both the standard output and error output to the log file.