SMRT Analysis: Assembly

From mn/ibv/bioinfwiki
Revision as of 14:39, 23 March 2015 by Ralfne@uio.no (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Introduction

The SMRT Analysis protocols include the Hierarchical Genome-Assembly Process (HGAP) protocol. This protocol covers the entire genome assembly process including the main steps of read correction, assembly and polishing of the resulting sequences. At the time of writing, the latest version of HGAP is the “HGAP.3” version. It speeds up the assembly process relative to the “HGAP.2” version, especially the read correction step. Even though it was introduced in SMRT Analysis 2.2, it is still flagged as a “beta” version in the 2.3.0 version user guide. This may be a typo; I have not found any information indicating that the “HGAP.3” version is unfinished, nor have I encountered any problems running this protocol.

The “HGAP.3” protocol

The “HGAP.3” protocol assumes all reads to be PacBio reads, correcting the longer reads using shorter PacBio reads. At present (in SMRT Analyis 2.2.0) the maximal genome size is 130 MB (it has been increased from 10 MB in previous versions). The main steps include read correction using the “PB/dagcon” algorithm, assembly of long reads using the Celera assembler, and polishing of the resulting sequences using the “Quiver” program.

Obtaining the “HGAP.3” protocol file

The “HGAP.3” protocol is included in the SMRT Analysis example folder:

/cluster/software/VERSIONS/smrtanalysis-2.3.0/doc/examples/smrtpipe_assembly_hgap3/params.xml

This version of the HGAP protocol is optimized for genomes larger than bacterial genomes; if performing high-coverage bacterial sequencing some of the values in this example may have to be changed. See https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP-in-SMRT-Analysis for details.

Setting the genome size

Most defaults used in the example protocol can be accepted as they are. However, it is critical to adjust the default genome size value to a realistic estimate. This value is found in the “P_AssembleUnitig” module:

<module name="P_AssembleUnitig">

   <param name="genomeSize">

      <value>47000</value>

   </param>

This value is used to categorize the input reads as “short” and “long” reads, the short reads being used to correct the long reads. Based in the specified genome size value, enough of the longest reads are categorized as “long” so as to average a 30X coverage of the whole genome. This is the default behavior of the “HGAP.3” protocol; the resulting minimum seed read length that separates “short” from “long” reads can also be set manually. If selecting this option, the default minimum seed read length is 6kb by default (unless changed by the user).

Using the HGAP protocol on Abel

The assembly process will require many hours, perhaps days to complete. Thus, it must be run as a separate job on the Abel HPC. This means including the smrtpipe.py command in a job file, specifying adequate memory and cpu-hour constraints.

Example of a “HGAP.3” job file

The following example will run the “HGAP.3” protocol located at /path/to/paramsFile.xml, using an input file located at /path/to/paramsFile.xml. The output is written to the /path/to/outputFolder/ folder. On entire node is requested (i.e. demanding 16 CPUs per node); the memory requirement is given as 3500 MB per CPU. These settings will enable the assembly of genomes with sizes at least up to 40 MB. The --account value (i.e. "myAccountName”) must be changed to a valid Abel account (for most members of the UoO, the “uio” account will suffice). The module purge command is included so as to make sure no unwanted modules are loaded when running the smrtpipe pipeline. 

#!/bin/bash

#SBATCH --job-name=my_HGAP_assembly

#SBATCH --account=myAccountName

#SBATCH --time=48:00:00

#SBATCH --mem-per-cpu=3500M

#SBATCH --nodes=1

#SBATCH --ntasks-per-node=16


source /cluster/bin/jobsetup


module purge


module load smrtanalysis/2.3.0


smrtpipe.py -D TMP=/path/to/outputFolder/ -D SHARED_DIR=/path/to/outputFolder/ -D NPROC=

16 --output=/path/to/outputFolder/ --params=/path/to/paramsFile.xml xml: /path/to/inputFile.xml 

After creating this job file, it must be submitted into the Abel queuing system:

module purge

sbatch /path/to/jobFile

The module purge command has been included here so as to make sure the smrtanalysis module is not loaded when submitting the job script (if it is loaded, for enigmatic reasons it will provoke a parsing error).

Assembly examples

The above job script has been tested with varying input files and genomes. The memory requirements have been found to be close to the acceptable minimum, but will enable the assembly of genomes of at least 40 MB in size.

Some of those results include:


Assembly of the E. coli genome using only one bax.h5 file (1/3 of the available read data; 2 GB out of 6 GB in file size); genome size has been set to 5.000.000 basepairs:

Time used ~40 minutes (i.e. ca 10 CPU-hours since 16 CPUs have been used)
Polished Contigs 149
Max Contig Length 273288
N50 Contig Length 66615
Sum of Contig Lengths 4463041


Assembly of the E. coli genome using three bax.h5 files (all of the available read data; 6 GB in file size); genome size has been set to 5.000.000 basepairs:

Time used ~1 hour (i.e. ca 16 CPU-hours since 16 CPUs have been used)
Polished Contigs 1
Max Contig Length 4657311
N50 Contig Length 4657311
Sum of Contig Lengths 4657311

 

Assembly of the N. crassa genome using six bax.h5 files (all of the available read data; 29 GB in file size); genome size has been set to 40.000.000 basepairs:

Time used ~5 hour (i.e. ca 80 CPU-hours since 16 CPUs have been used)
Polished Contigs 2726
Max Contig Length 70783
N50 Contig Length 3751
Sum of Contig Lengths 7552784