SMRT Analysis: Introduction

From mn/ibv/bioinfwiki
Jump to: navigation, search


PacBio sequencing differs from most other sequencing technologies in that it provides fewer, much longer, but error-prone reads. This means that many of the bioinformatic tools developed for other sequencing technologies cannot be used with PacBio reads. Therefore, PacBio has developed its own pipeline that is capable of performing many of sequencing-related tasks, spanning from assembly to variant calling.

The PacBio pipeline consists of two main components. The web-based SMRT Portal provides a graphical user interface for selecting PacBio sequencing data, the setup of the pipeline protocol, and tools for evaluating the output of the pipeline. The actual computational work is done by the SMRT Analysis package. This software bundle contains the program, which calls the individual programs (such as the Celera assembler) that are part of the pipeline.

The SMRT Portal communicates with the using XML files. These files contain the filenames for the files containing read data (input or input.xml files), and specify which programs to include in the pipeline execution (protocol or params.xml files). These files (typically, one of each) are generated by the SMRT Portal and subsequently used as arguments when starting the smrtpipe program.

The SMRT analysis package is open-source and can be downloaded from the PacBio website free of charge. The SMRT Portal, however, is available only for PacBio customers. This means that many users end up with writing their own input.xml and params.xml files. At the UoO, members of the CEES group have access to the SMRT Portal, but even they must manually run the smrtpipe with the input and protocol files produced by the SMRT Portal. Therefore, this PacBio walkthrough will focus on the command-line usage of the program.

Running the program

The program is part of the SMRT Analysis package. Several versions of this package are pre-installed on the UoO Abel high-performance computing (HPC) system, available for use after loading the desired module. The program itself only requires the filenames of the params.xml and input.xml files to run: params=params.xml xml:input.xml

However on Abel, should NOT be used as displayed above. There are several reasons for this:

  • does not resolve relative paths correctly. ALWAYS use absolute paths (starting at the root with ”/...”)!
  • creates many temporary files and folders. We need to specify a location for these, so as not to clutter the default location. Some protocols may actually crash if the the tmp-folder has not been properly defined.
  • we can run in paralell mode, speeding up execution time considerably.

A valid way of running on Abel may be:

module load smrtanalysis/2.3.0 -D TMP=/path/to/outputFolder/ -D SHARED_DIR=/path/to/outputFolder/ -D NPROC=8 --output=/path/to/outputFolder/ --params=/path/to/params.xml xml:/path/to/input.xml &> /path/to/smrtpipe.err

Here, we use the ”-D” option to override default settings for the tmp folder, the shared folder and the number of processors used (8). We also specify our ouputFolder (in this example, we are using the same folder for output and temporary files). Finally, the output is piped to an error file. Note that ALL files and folders are specified using absolute paths. Often, examples of usage are given as follows: -D TMP=./ -D SHARED_DIR=./ -D NPROC=8 params=/path/to/params.xml xml: /path/to/input.xml &>smrtpipe.err

This is not advisable, as using relative paths for the tmp and shared folders WILL fail for at least some protocols!

Smrtpipe workflow, results and error logs

After running the smrtpipe program, the outputFolder will contain four main folders:





 The “results” folder contains files describing the results of the smrtpipe run, such as reports (often in the form of html-pages) and images (png format). Most of the actual data resulting from a smrtpipe run are found in the “data” folder. The “workflow” folder contains subfolders corresponding to the various modules included in the protocol file. Each of those subfolders contains shell scripts that are actually written by the smrtpipe program as part of the pipeline execution. These shell scripts typically invoke python scripts stored in the “/smrtanalysis-2.3.0/analysis/bin/” folder.

The “log” folder contains the “master.log” text file that represents the main log file of the smrtpipe program. It is rather detailed, and can provide valuable information in case of errors. This folder also contains subfolders corresponding to the various steps performed by the pipeline. These subfolders contain log files written by the shell scripts mentioned above. Typically, they first write the whole shell script to the log file (to verify that they got written correctly), followed be the actual log statements. Sometimes, these log files too may hold important information.

Finally, the output folder contains an “index.html” file. This file serves as an entry point to the reports and results create by the pipeline. To use it, copy the entire output directory over to your local machine, and open this file using a web browser.