Difference between revisions of "SMRT Analysis: Read correction"

From mn/ibv/bioinfwiki
Jump to: navigation, search
(Created page with "Smrtpipe – Using read correction Introduction Due to the nature of real-time sequencing, PacBio reads are long, but of comparable poor quality. Since the PacBio library pr...")
 
Line 1: Line 1:
Smrtpipe – Using read correction
+
= Introduction =
 
 
Introduction
 
  
 
Due to the nature of real-time sequencing, PacBio reads are long, but of comparable poor quality. Since the PacBio library preparation results in circular DNA molecules, some of the shorter molecules are processed by the polymerase several times. This results in multiple reads of the same molecule, and this information then is used to produce high-quality circular consensus (CCS) reads. However, in genome assembly long reads, even of poor quality, are the extremely important. Therefore, multiple methods of error-correcting long (i.e. non-CCS) reads have been implemented. Common to these methods is the alignment of shorter reads to the long reads, thus calculating a consensus sequence for the long reads.
 
Due to the nature of real-time sequencing, PacBio reads are long, but of comparable poor quality. Since the PacBio library preparation results in circular DNA molecules, some of the shorter molecules are processed by the polymerase several times. This results in multiple reads of the same molecule, and this information then is used to produce high-quality circular consensus (CCS) reads. However, in genome assembly long reads, even of poor quality, are the extremely important. Therefore, multiple methods of error-correcting long (i.e. non-CCS) reads have been implemented. Common to these methods is the alignment of shorter reads to the long reads, thus calculating a consensus sequence for the long reads.
  
Using self-correction of PacBio reads
+
= Using self-correction of PacBio reads =
  
 
PacBios smrtanalysis package seems to be moving towards supporting only self-correction of PacBio reads. This means using the PacBio shorter reads to error-correct longer PacBio reads. The RS_HGAP_Assembly.3 protocol is using self-correction before assembly; the P_PreAssemblerDagcon module is doing the actual read-correction.
 
PacBios smrtanalysis package seems to be moving towards supporting only self-correction of PacBio reads. This means using the PacBio shorter reads to error-correct longer PacBio reads. The RS_HGAP_Assembly.3 protocol is using self-correction before assembly; the P_PreAssemblerDagcon module is doing the actual read-correction.
Line 13: Line 11:
 
The current version of smrtanalysis supports genome sizes of up to 130 Mb. Also, the default settings favour the assembly of larger genomes with potentially lower coverage, with defaults that may not make sense for high-coverage bacterial genomes.
 
The current version of smrtanalysis supports genome sizes of up to 130 Mb. Also, the default settings favour the assembly of larger genomes with potentially lower coverage, with defaults that may not make sense for high-coverage bacterial genomes.
  
For details, see xxx
+
= Using Illumina reads for read correction =
 
 
Using Illumina reads for read correction
 
  
 
Correcting PacBio long reads using Illumina short reads is another option for read-correction. This was first made possible by the pbToCA algorithm. Later on, error-correction using Illumina reads also became possible inside the smrtpipe pipeline. However, the latest versions of the smrtpipe fail when running this protocol (it is missing the “smrtanalysis-2.3.0/analysis/bin/align2layouts.py” script). I have been told smrtpipe (version 2.3.0) no longer supports read-correction using Illumina reads, even though the failing protocol was taken from the smrtpipe user guide version 2.3.0. Be this as it may, using smrtpipe to do Illumina error-correction only works in smrtanalysis verion 2.1.1 or earlier. Alternatively, other options also exist outside the smrtanalysis package (see later sections for this).
 
Correcting PacBio long reads using Illumina short reads is another option for read-correction. This was first made possible by the pbToCA algorithm. Later on, error-correction using Illumina reads also became possible inside the smrtpipe pipeline. However, the latest versions of the smrtpipe fail when running this protocol (it is missing the “smrtanalysis-2.3.0/analysis/bin/align2layouts.py” script). I have been told smrtpipe (version 2.3.0) no longer supports read-correction using Illumina reads, even though the failing protocol was taken from the smrtpipe user guide version 2.3.0. Be this as it may, using smrtpipe to do Illumina error-correction only works in smrtanalysis verion 2.1.1 or earlier. Alternatively, other options also exist outside the smrtanalysis package (see later sections for this).
  
 
In the following example, the input.xml file needs to point to both PacBio reads (here in the form of one bax.h5 file), and Illumina reads (formatted as a fastq file):
 
In the following example, the input.xml file needs to point to both PacBio reads (here in the form of one bax.h5 file), and Illumina reads (formatted as a fastq file):
 
+
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;">
 
&lt;?xml version="1.0"?&gt;
 
&lt;?xml version="1.0"?&gt;
  
Line 38: Line 34:
  
 
&lt;/pacbioAnalysisInputs&gt;
 
&lt;/pacbioAnalysisInputs&gt;
 
+
</div>
&nbsp;
+
&nbsp;The following settings.xml file will perform the read-correction (note the “useFastqAsShortReads” parameter):
 
+
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;">
The following settings.xml file will perform the read-correction (note the “useFastqAsShortReads” parameter):
+
&lt;?xml version="1.0"&nbsp;?&gt;
 
 
&lt;?xml version="1.0" ?&gt;
 
  
 
&lt;smrtpipeSettings&gt;
 
&lt;smrtpipeSettings&gt;
Line 112: Line 106:
  
 
&lt;/smrtpipeSettings&gt;
 
&lt;/smrtpipeSettings&gt;
 
+
</div>
&nbsp;
+
&nbsp;Run this example with (remember NOT to load the newest smrtanalysis package!):
 
+
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;">
Run this example with (remember NOT to load the newest smrtanalysis package!):
 
 
 
 
module load smrtanalysis/2.1.1
 
module load smrtanalysis/2.1.1
  
 
smrtpipe.py -D TMP=/path/to/outputFolder -D SHARED_DIR=/path/to/outputFolder -D NPROC=8 --output=/path/to/outputFolder --params=/path/to/settings.xml xml:/path/to/input.xml
 
smrtpipe.py -D TMP=/path/to/outputFolder -D SHARED_DIR=/path/to/outputFolder -D NPROC=8 --output=/path/to/outputFolder --params=/path/to/settings.xml xml:/path/to/input.xml
 
+
</div>
 
The output...
 
The output...
  
 
/outputFolder/data/corrected.fasta&nbsp; and fastq
 
/outputFolder/data/corrected.fasta&nbsp; and fastq
 
&nbsp;
 
 
&nbsp;
 
 
&nbsp;
 
 
&nbsp;
 
 
&nbsp;
 
 
-PacBio only:
 
 
P_PreAssemblerDagcon Module does the reads correction in v2.3.0 (RS_HGAP_Assembly.3)
 
 
&nbsp;
 
 
1. Genome Size
 
 
To accurately determine the Minimum Seed Read Length and the coverage of trimmed preassembled reads going into the assembly step, it is important to adjust the target genome size as accurately as possible.
 
 
&nbsp;
 
 
2. Automatic Minimum Seed Read Length calculation
 
 
The Minimum Seed Read Length that results in at least 30X target genome coverage by the longest subreads is being calculated automatically (the default option). To use the user-selected Minimum Seed Read Length, the default option has to be deselected. If less than 30X coverage is being used for the HGAP process, the algorithm will use the user-selected Minimum Seed Length (6kb default), so lowering the default setting to 500bp is required to allow all-vs-all PreAssembly at lower than 30X coverage.
 
 
&nbsp;
 
 
The HGAP 2 & 3 workflows have been configured for larger genomes with potentially lower coverage, with defaults that may not make sense for high-coverage bacterial genomes.
 

Revision as of 19:42, 9 March 2015

Introduction

Due to the nature of real-time sequencing, PacBio reads are long, but of comparable poor quality. Since the PacBio library preparation results in circular DNA molecules, some of the shorter molecules are processed by the polymerase several times. This results in multiple reads of the same molecule, and this information then is used to produce high-quality circular consensus (CCS) reads. However, in genome assembly long reads, even of poor quality, are the extremely important. Therefore, multiple methods of error-correcting long (i.e. non-CCS) reads have been implemented. Common to these methods is the alignment of shorter reads to the long reads, thus calculating a consensus sequence for the long reads.

Using self-correction of PacBio reads

PacBios smrtanalysis package seems to be moving towards supporting only self-correction of PacBio reads. This means using the PacBio shorter reads to error-correct longer PacBio reads. The RS_HGAP_Assembly.3 protocol is using self-correction before assembly; the P_PreAssemblerDagcon module is doing the actual read-correction.

In this case, the input.xml file only needs to contain the filenames of the bas.h5/bax.h5 files holding the PacBio reads. The pipeline itself splits the reads into short and long reads, based on the genome size specified by the user. Basically, the read dataset is split so as to assure a sufficient high average coverage of the whole genome by the long reads. Therefore, setting the genome size correctly is very important for this process.

The current version of smrtanalysis supports genome sizes of up to 130 Mb. Also, the default settings favour the assembly of larger genomes with potentially lower coverage, with defaults that may not make sense for high-coverage bacterial genomes.

Using Illumina reads for read correction

Correcting PacBio long reads using Illumina short reads is another option for read-correction. This was first made possible by the pbToCA algorithm. Later on, error-correction using Illumina reads also became possible inside the smrtpipe pipeline. However, the latest versions of the smrtpipe fail when running this protocol (it is missing the “smrtanalysis-2.3.0/analysis/bin/align2layouts.py” script). I have been told smrtpipe (version 2.3.0) no longer supports read-correction using Illumina reads, even though the failing protocol was taken from the smrtpipe user guide version 2.3.0. Be this as it may, using smrtpipe to do Illumina error-correction only works in smrtanalysis verion 2.1.1 or earlier. Alternatively, other options also exist outside the smrtanalysis package (see later sections for this).

In the following example, the input.xml file needs to point to both PacBio reads (here in the form of one bax.h5 file), and Illumina reads (formatted as a fastq file):

<?xml version="1.0"?>

<pacbioAnalysisInputs>

   <dataReferences>

        <url ref="run:0000000-0001">

           <location>/path/to/bax.h5</location>

       </url>

       <url ref="fastq:/path/to/Fastq"/>

   </dataReferences>

</pacbioAnalysisInputs>

 The following settings.xml file will perform the read-correction (note the “useFastqAsShortReads” parameter):

<?xml version="1.0" ?>

<smrtpipeSettings>

  <module name="P_Fetch"/>

    <module name="P_Filter">

        <param name="filters">

               <value>MinRL=1000,MinReadScore=0.80</value>

        </param>

        <param name="artifact">

               <value>-1000</value>

        </param>

    </module>

    <module name="P_PreAssembler">

        <param name="useFastqAsShortReads">

            <value>True</value>

        </param>

        <param name="useFastaAsLongReads">

            <value>False</value>

        </param>

        <param name="useLongReadsInConsensus">

            <value>False</value>

        </param>

        <param name="useUnalignedReadsInConsensus">

            <value>False</value>

        </param>

        <param name="blasrOpts">

            <value>-minMatch 8 -minReadLength 30 -maxScore -100 -minPctIdentity 70 -bestn 100</value>

        </param>

        <param name="layoutOpts">

            <value>--overlapTolerance=25</value>

        </param>

        <param name="consensusOpts">

            <value>-w 2</value>

        </param>

   </module>

</smrtpipeSettings>

 Run this example with (remember NOT to load the newest smrtanalysis package!):

module load smrtanalysis/2.1.1

smrtpipe.py -D TMP=/path/to/outputFolder -D SHARED_DIR=/path/to/outputFolder -D NPROC=8 --output=/path/to/outputFolder --params=/path/to/settings.xml xml:/path/to/input.xml

The output...

/outputFolder/data/corrected.fasta  and fastq