Difference between revisions of "SMRT Analysis: Assembly"

From mn/ibv/bioinfwiki
Jump to: navigation, search
 
(4 intermediate revisions by the same user not shown)
Line 3: Line 3:
 
The SMRT Analysis protocols include the Hierarchical Genome-Assembly Process (HGAP) protocol. This protocol covers the entire genome assembly process including the main steps of read correction, assembly and polishing of the resulting sequences. At the time of writing, the latest version of HGAP is the “HGAP.3” version. It speeds up the assembly process relative to the “HGAP.2” version, especially the read correction step. Even though it was introduced in SMRT Analysis 2.2, it is still flagged as a “beta” version in the 2.3.0 version user guide. This may be a typo; I have not found any information indicating that the “HGAP.3” version is unfinished, nor have I encountered any problems running this protocol.
 
The SMRT Analysis protocols include the Hierarchical Genome-Assembly Process (HGAP) protocol. This protocol covers the entire genome assembly process including the main steps of read correction, assembly and polishing of the resulting sequences. At the time of writing, the latest version of HGAP is the “HGAP.3” version. It speeds up the assembly process relative to the “HGAP.2” version, especially the read correction step. Even though it was introduced in SMRT Analysis 2.2, it is still flagged as a “beta” version in the 2.3.0 version user guide. This may be a typo; I have not found any information indicating that the “HGAP.3” version is unfinished, nor have I encountered any problems running this protocol.
  
 
+
= The “HGAP.3” protocol =
 
 
= “HGAP.3” protocol =
 
  
 
The “HGAP.3” protocol assumes all reads to be PacBio reads, correcting the longer reads using shorter PacBio reads. At present (in SMRT Analyis 2.2.0) the maximal genome size is 130 MB (it has been increased from 10 MB in previous versions). The main steps include read correction using the “PB/dagcon” algorithm, assembly of long reads using the Celera assembler, and polishing of the resulting sequences using the “Quiver” program.
 
The “HGAP.3” protocol assumes all reads to be PacBio reads, correcting the longer reads using shorter PacBio reads. At present (in SMRT Analyis 2.2.0) the maximal genome size is 130 MB (it has been increased from 10 MB in previous versions). The main steps include read correction using the “PB/dagcon” algorithm, assembly of long reads using the Celera assembler, and polishing of the resulting sequences using the “Quiver” program.
Line 16: Line 14:
  
 
This version of the HGAP protocol is optimized for genomes larger than bacterial genomes; if performing high-coverage bacterial sequencing some of the values in this example may have to be changed. See [https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP-in-SMRT-Analysis https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP-in-SMRT-Analysis] for details.
 
This version of the HGAP protocol is optimized for genomes larger than bacterial genomes; if performing high-coverage bacterial sequencing some of the values in this example may have to be changed. See [https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP-in-SMRT-Analysis https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP-in-SMRT-Analysis] for details.
 
 
  
 
== Setting the genome size ==
 
== Setting the genome size ==
Line 27: Line 23:
 
 
  
<param name="genomeSize">
+
   <param name="genomeSize">
  
        <value>47000</value>
+
      <value>47000</value>
  
    </param>
+
   </param>
 
</div>
 
</div>
 
This value is used to categorize the input reads as “short” and “long” reads, the short reads being used to correct the long reads. Based in the specified genome size value, enough of the longest reads are categorized as “long” so as to average a 30X coverage of the whole genome. This is the default behavior of the “HGAP.3” protocol; the resulting minimum seed read length that separates “short” from “long” reads can also be set manually. If selecting this option, the default minimum seed read length is 6kb by default (unless changed by the user).
 
This value is used to categorize the input reads as “short” and “long” reads, the short reads being used to correct the long reads. Based in the specified genome size value, enough of the longest reads are categorized as “long” so as to average a 30X coverage of the whole genome. This is the default behavior of the “HGAP.3” protocol; the resulting minimum seed read length that separates “short” from “long” reads can also be set manually. If selecting this option, the default minimum seed read length is 6kb by default (unless changed by the user).
 
&nbsp;
 
  
 
= Using the HGAP protocol on Abel =
 
= Using the HGAP protocol on Abel =
Line 43: Line 37:
 
== Example of a “HGAP.3” job file ==
 
== Example of a “HGAP.3” job file ==
  
The following example will run the “HGAP.3” protocol located at /path/to/paramsFile.xml”, using an input file located at /path/to/paramsFile.xml”. The output is written to the /path/to/outputFolder/folder. On entire node is requested (i.e. demanding 16 CPUs per node); the memory requirement is given as 3500 MB per CPU. These settings will enable the assembly of genomes with sizes at least up to 40 MB. The --account” value (“myAccountName”) must be changed to a valid Abel account (for most members of the UoO, the “uio” account will suffice). The “module purge” command is included so as to make sure no unwanted modules are loaded when running the smrtpipe pipeline.&nbsp;
+
The following example will run the “HGAP.3” protocol located at <span style="font-family:courier new,courier,monospace;">/path/to/paramsFile.xml</span>, using an input file located at <span style="font-family:courier new,courier,monospace;">/path/to/paramsFile.xml</span>. The output is written to the <span style="font-family:courier new,courier,monospace;">/path/to/outputFolder/</span>&nbsp;folder. On entire node is requested (i.e. demanding 16 CPUs per node); the memory requirement is given as 3500 MB per CPU. These settings will enable the assembly of genomes with sizes at least up to 40 MB. The <span style="font-family:courier new,courier,monospace;">--account</span> value (i.e. "<span style="font-family:courier new,courier,monospace;">myAccountName</span>”) must be changed to a valid Abel account (for most members of the UoO, the “<span style="font-family:courier new,courier,monospace;">uio</span>” account will suffice). The <span style="font-family:courier new,courier,monospace;">module purge</span> command is included so as to make sure no unwanted modules are loaded when running the smrtpipe pipeline.&nbsp;
 
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;">
 
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;">
 
<nowiki>#</nowiki>!/bin/bash
 
<nowiki>#</nowiki>!/bin/bash
Line 58: Line 52:
  
 
<nowiki>#</nowiki>SBATCH --ntasks-per-node=16
 
<nowiki>#</nowiki>SBATCH --ntasks-per-node=16
 +
 +
  
 
source /cluster/bin/jobsetup
 
source /cluster/bin/jobsetup
 +
 +
  
 
module purge
 
module purge
 +
 +
  
 
module load smrtanalysis/2.3.0
 
module load smrtanalysis/2.3.0
 +
 +
  
 
smrtpipe.py -D TMP=/path/to/outputFolder/ -D SHARED_DIR=/path/to/outputFolder/ -D NPROC=
 
smrtpipe.py -D TMP=/path/to/outputFolder/ -D SHARED_DIR=/path/to/outputFolder/ -D NPROC=
  
16 --output=/path/to/outputFolder/ --params=/path/to/paramsFile.xml xml: /path/to/inputFile.xml
+
16 --output=/path/to/outputFolder/ --params=/path/to/paramsFile.xml xml: /path/to/inputFile.xml<span style="font-family: Arial, Verdana, sans-serif; background-color: rgb(255, 255, 255);">&nbsp;</span>
 
</div>
 
</div>
&nbsp;
 
 
 
After creating this job file, it must be submitted into the Abel queuing system:
 
After creating this job file, it must be submitted into the Abel queuing system:
 
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;">
 
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;">
Line 77: Line 77:
 
sbatch /path/to/jobFile
 
sbatch /path/to/jobFile
 
</div>
 
</div>
The “module purge” command has been included here so as to make sure the smrtanalysis module is not loaded when submitting the job script (if it is loaded, for enigmatic reasons it will provoke a parsing error).
+
The <span style="font-family:courier new,courier,monospace;">module purge</span> command has been included here so as to make sure the smrtanalysis module is not loaded when submitting the job script (if it is loaded, for enigmatic reasons it will provoke a parsing error).
 
 
&nbsp;
 
  
 
== Assembly examples ==
 
== Assembly examples ==
Line 87: Line 85:
 
Some of those results include:
 
Some of those results include:
  
<u>Assembly of the ''E. coli'' genome using only one bax.h5 file (1/3 of the available read data; 2 GB out of 6 GB in file size); genome size has been set to 5.000.000 basepairs:</u>
 
 
Time used: ~40 minutes (i.e. ca 10 CPU-hours since 16 CPUs have been used)
 
 
Polished Contigs:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 149
 
 
Max Contig Length:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 273288
 
 
N50 Contig Length:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 66615
 
 
Sum of Contig Lengths:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4463041
 
  
&nbsp;
 
  
<u>Assembly of the ''E. coli'' genome using three bax.h5 files (all of the available read data; 6 GB in file size); genome size has been set to 5.000.000 basepairs:</u>
+
'''Assembly of the ''E. coli'' genome using only one bax.h5 file (1/3 of the available read data; 2 GB out of 6 GB in file size); genome size has been set to 5.000.000 basepairs:'''
  
Time used: ~1 hour (i.e. ca 16 CPU-hours since 16 CPUs have been used)
+
{| border="0" cellspacing="1" cellpadding="1" style="width: 100%;"
 +
|-
 +
| Time used
 +
| ~40 minutes (i.e. ca 10 CPU-hours since 16 CPUs have been used)
 +
|-
 +
| Polished Contigs
 +
| 149
 +
|-
 +
| Max Contig Length
 +
| 273288
 +
|-
 +
| N50 Contig Length
 +
| 66615
 +
|-
 +
| Sum of Contig Lengths
 +
| 4463041
 +
|}
  
Polished Contigs:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1
 
  
Max Contig Length:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4657311
 
  
N50 Contig Length:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4657311
+
'''Assembly of the ''E. coli'' genome using three bax.h5 files (all of the available read data; 6 GB in file size); genome size has been set to 5.000.000 basepairs:'''
  
Sum of Contig Lengths:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4657311
+
{| border="0" cellspacing="1" cellpadding="1" style="width: 100%;"
 +
|-
 +
| Time used
 +
| ~1 hour (i.e. ca 16 CPU-hours since 16 CPUs have been used)
 +
|-
 +
| Polished Contigs
 +
| 1
 +
|-
 +
| Max Contig Length
 +
| 4657311
 +
|-
 +
| N50 Contig Length
 +
| 4657311
 +
|-
 +
| Sum of Contig Lengths
 +
| 4657311
 +
|}
  
 
&nbsp;
 
&nbsp;
  
<u>Assembly of the ''N. crassa'' genome using six bax.h5 files (all of the available read data; 29 GB in file size); genome size has been set to 40.000.000 basepairs:</u>
+
'''Assembly of the ''N. crassa'' genome using six bax.h5 files (all of the available read data; 29 GB in file size); genome size has been set to 40.000.000 basepairs:'''
 
 
Time used: ~5 hour (i.e. ca 80 CPU-hours since 16 CPUs have been used)
 
 
 
Polished Contigs:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2726
 
 
 
Max Contig Length:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 70783
 
 
 
N50 Contig Length:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3751
 
  
Sum of Contig Lengths:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7552784
+
{| border="0" cellspacing="1" cellpadding="1" style="width: 100%;"
 +
|-
 +
| Time used
 +
| ~5 hour (i.e. ca 80 CPU-hours since 16 CPUs have been used)
 +
|-
 +
| Polished Contigs
 +
| 2726
 +
|-
 +
| Max Contig Length
 +
| 70783
 +
|-
 +
| N50 Contig Length
 +
| 3751
 +
|-
 +
| Sum of Contig Lengths
 +
| 7552784
 +
|}

Latest revision as of 15:39, 23 March 2015

Introduction

The SMRT Analysis protocols include the Hierarchical Genome-Assembly Process (HGAP) protocol. This protocol covers the entire genome assembly process including the main steps of read correction, assembly and polishing of the resulting sequences. At the time of writing, the latest version of HGAP is the “HGAP.3” version. It speeds up the assembly process relative to the “HGAP.2” version, especially the read correction step. Even though it was introduced in SMRT Analysis 2.2, it is still flagged as a “beta” version in the 2.3.0 version user guide. This may be a typo; I have not found any information indicating that the “HGAP.3” version is unfinished, nor have I encountered any problems running this protocol.

The “HGAP.3” protocol

The “HGAP.3” protocol assumes all reads to be PacBio reads, correcting the longer reads using shorter PacBio reads. At present (in SMRT Analyis 2.2.0) the maximal genome size is 130 MB (it has been increased from 10 MB in previous versions). The main steps include read correction using the “PB/dagcon” algorithm, assembly of long reads using the Celera assembler, and polishing of the resulting sequences using the “Quiver” program.

Obtaining the “HGAP.3” protocol file

The “HGAP.3” protocol is included in the SMRT Analysis example folder:

/cluster/software/VERSIONS/smrtanalysis-2.3.0/doc/examples/smrtpipe_assembly_hgap3/params.xml

This version of the HGAP protocol is optimized for genomes larger than bacterial genomes; if performing high-coverage bacterial sequencing some of the values in this example may have to be changed. See https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP-in-SMRT-Analysis for details.

Setting the genome size

Most defaults used in the example protocol can be accepted as they are. However, it is critical to adjust the default genome size value to a realistic estimate. This value is found in the “P_AssembleUnitig” module:

<module name="P_AssembleUnitig">

   <param name="genomeSize">

      <value>47000</value>

   </param>

This value is used to categorize the input reads as “short” and “long” reads, the short reads being used to correct the long reads. Based in the specified genome size value, enough of the longest reads are categorized as “long” so as to average a 30X coverage of the whole genome. This is the default behavior of the “HGAP.3” protocol; the resulting minimum seed read length that separates “short” from “long” reads can also be set manually. If selecting this option, the default minimum seed read length is 6kb by default (unless changed by the user).

Using the HGAP protocol on Abel

The assembly process will require many hours, perhaps days to complete. Thus, it must be run as a separate job on the Abel HPC. This means including the smrtpipe.py command in a job file, specifying adequate memory and cpu-hour constraints.

Example of a “HGAP.3” job file

The following example will run the “HGAP.3” protocol located at /path/to/paramsFile.xml, using an input file located at /path/to/paramsFile.xml. The output is written to the /path/to/outputFolder/ folder. On entire node is requested (i.e. demanding 16 CPUs per node); the memory requirement is given as 3500 MB per CPU. These settings will enable the assembly of genomes with sizes at least up to 40 MB. The --account value (i.e. "myAccountName”) must be changed to a valid Abel account (for most members of the UoO, the “uio” account will suffice). The module purge command is included so as to make sure no unwanted modules are loaded when running the smrtpipe pipeline. 

#!/bin/bash

#SBATCH --job-name=my_HGAP_assembly

#SBATCH --account=myAccountName

#SBATCH --time=48:00:00

#SBATCH --mem-per-cpu=3500M

#SBATCH --nodes=1

#SBATCH --ntasks-per-node=16


source /cluster/bin/jobsetup


module purge


module load smrtanalysis/2.3.0


smrtpipe.py -D TMP=/path/to/outputFolder/ -D SHARED_DIR=/path/to/outputFolder/ -D NPROC=

16 --output=/path/to/outputFolder/ --params=/path/to/paramsFile.xml xml: /path/to/inputFile.xml 

After creating this job file, it must be submitted into the Abel queuing system:

module purge

sbatch /path/to/jobFile

The module purge command has been included here so as to make sure the smrtanalysis module is not loaded when submitting the job script (if it is loaded, for enigmatic reasons it will provoke a parsing error).

Assembly examples

The above job script has been tested with varying input files and genomes. The memory requirements have been found to be close to the acceptable minimum, but will enable the assembly of genomes of at least 40 MB in size.

Some of those results include:


Assembly of the E. coli genome using only one bax.h5 file (1/3 of the available read data; 2 GB out of 6 GB in file size); genome size has been set to 5.000.000 basepairs:

Time used ~40 minutes (i.e. ca 10 CPU-hours since 16 CPUs have been used)
Polished Contigs 149
Max Contig Length 273288
N50 Contig Length 66615
Sum of Contig Lengths 4463041


Assembly of the E. coli genome using three bax.h5 files (all of the available read data; 6 GB in file size); genome size has been set to 5.000.000 basepairs:

Time used ~1 hour (i.e. ca 16 CPU-hours since 16 CPUs have been used)
Polished Contigs 1
Max Contig Length 4657311
N50 Contig Length 4657311
Sum of Contig Lengths 4657311

 

Assembly of the N. crassa genome using six bax.h5 files (all of the available read data; 29 GB in file size); genome size has been set to 40.000.000 basepairs:

Time used ~5 hour (i.e. ca 80 CPU-hours since 16 CPUs have been used)
Polished Contigs 2726
Max Contig Length 70783
N50 Contig Length 3751
Sum of Contig Lengths 7552784