SMRT Analysis: Assembly
The SMRT Analysis protocols include the Hierarchical Genome-Assembly Process (HGAP) protocol. This protocol covers the entire genome assembly process including the main steps of read correction, assembly and polishing of the resulting sequences. At the time of writing, the latest version of HGAP is the “HGAP.3” version. It speeds up the assembly process relative to the “HGAP.2” version, especially the read correction step. Even though it was introduced in SMRT Analysis 2.2, it is still flagged as a “beta” version in the 2.3.0 version user guide. This may be a typo; I have not found any information indicating that the “HGAP.3” version is unfinished, nor have I encountered any problems running this protocol.
The “HGAP.3” protocol
The “HGAP.3” protocol assumes all reads to be PacBio reads, correcting the longer reads using shorter PacBio reads. At present (in SMRT Analyis 2.2.0) the maximal genome size is 130 MB (it has been increased from 10 MB in previous versions). The main steps include read correction using the “PB/dagcon” algorithm, assembly of long reads using the Celera assembler, and polishing of the resulting sequences using the “Quiver” program.
Obtaining the “HGAP.3” protocol file
The “HGAP.3” protocol is included in the SMRT Analysis example folder:
This version of the HGAP protocol is optimized for genomes larger than bacterial genomes; if performing high-coverage bacterial sequencing some of the values in this example may have to be changed. See https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP-in-SMRT-Analysis for details.
Setting the genome size
Most defaults used in the example protocol can be accepted as they are. However, it is critical to adjust the default genome size value to a realistic estimate. This value is found in the “P_AssembleUnitig” module:
This value is used to categorize the input reads as “short” and “long” reads, the short reads being used to correct the long reads. Based in the specified genome size value, enough of the longest reads are categorized as “long” so as to average a 30X coverage of the whole genome. This is the default behavior of the “HGAP.3” protocol; the resulting minimum seed read length that separates “short” from “long” reads can also be set manually. If selecting this option, the default minimum seed read length is 6kb by default (unless changed by the user).
Using the HGAP protocol on Abel
The assembly process will require many hours, perhaps days to complete. Thus, it must be run as a separate job on the Abel HPC. This means including the smrtpipe.py command in a job file, specifying adequate memory and cpu-hour constraints.
Example of a “HGAP.3” job file
The following example will run the “HGAP.3” protocol located at /path/to/paramsFile.xml, using an input file located at /path/to/paramsFile.xml. The output is written to the /path/to/outputFolder/ folder. On entire node is requested (i.e. demanding 16 CPUs per node); the memory requirement is given as 3500 MB per CPU. These settings will enable the assembly of genomes with sizes at least up to 40 MB. The --account value (i.e. "myAccountName”) must be changed to a valid Abel account (for most members of the UoO, the “uio” account will suffice). The module purge command is included so as to make sure no unwanted modules are loaded when running the smrtpipe pipeline.
After creating this job file, it must be submitted into the Abel queuing system:
The module purge command has been included here so as to make sure the smrtanalysis module is not loaded when submitting the job script (if it is loaded, for enigmatic reasons it will provoke a parsing error).
The above job script has been tested with varying input files and genomes. The memory requirements have been found to be close to the acceptable minimum, but will enable the assembly of genomes of at least 40 MB in size.
Some of those results include:
Assembly of the E. coli genome using only one bax.h5 file (1/3 of the available read data; 2 GB out of 6 GB in file size); genome size has been set to 5.000.000 basepairs:
|Time used||~40 minutes (i.e. ca 10 CPU-hours since 16 CPUs have been used)|
|Max Contig Length||273288|
|N50 Contig Length||66615|
|Sum of Contig Lengths||4463041|
Assembly of the E. coli genome using three bax.h5 files (all of the available read data; 6 GB in file size); genome size has been set to 5.000.000 basepairs:
|Time used||~1 hour (i.e. ca 16 CPU-hours since 16 CPUs have been used)|
|Max Contig Length||4657311|
|N50 Contig Length||4657311|
|Sum of Contig Lengths||4657311|
Assembly of the N. crassa genome using six bax.h5 files (all of the available read data; 29 GB in file size); genome size has been set to 40.000.000 basepairs:
|Time used||~5 hour (i.e. ca 80 CPU-hours since 16 CPUs have been used)|
|Max Contig Length||70783|
|N50 Contig Length||3751|
|Sum of Contig Lengths||7552784|