Difference between revisions of "SMRT Analysis: Read correction"

From mn/ibv/bioinfwiki
Jump to: navigation, search
Line 114: Line 114:
 
</div>
 
</div>
 
The output of this protocol will be found in the "data" subfolder, in the form of both corrected fasta and fastq files:
 
The output of this protocol will be found in the "data" subfolder, in the form of both corrected fasta and fastq files:
 +
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;">
  
 
/outputFolder/data/corrected.fasta
 
/outputFolder/data/corrected.fasta
 
 
/outputFolder/data/corrected.fastq
 
/outputFolder/data/corrected.fastq
 +
</div>

Revision as of 18:44, 9 March 2015

Introduction

Due to the nature of real-time sequencing, PacBio reads are long, but of comparable poor quality. Since the PacBio library preparation results in circular DNA molecules, some of the shorter molecules are processed by the polymerase several times. This results in multiple reads of the same molecule, and this information then is used to produce high-quality circular consensus (CCS) reads. However, in genome assembly long reads, even of poor quality, are the extremely important. Therefore, multiple methods of error-correcting long (i.e. non-CCS) reads have been implemented. Common to these methods is the alignment of shorter reads to the long reads, thus calculating a consensus sequence for the long reads.

Using self-correction of PacBio reads

PacBios smrtanalysis package seems to be moving towards supporting only self-correction of PacBio reads. This means using the PacBio shorter reads to error-correct longer PacBio reads. The RS_HGAP_Assembly.3 protocol is using self-correction before assembly; the P_PreAssemblerDagcon module is doing the actual read-correction.

In this case, the input.xml file only needs to contain the filenames of the bas.h5/bax.h5 files holding the PacBio reads. The pipeline itself splits the reads into short and long reads, based on the genome size specified by the user. Basically, the read dataset is split so as to assure a sufficient high average coverage of the whole genome by the long reads. Therefore, setting the genome size correctly is very important for this process.

The current version of smrtanalysis supports genome sizes of up to 130 Mb. Also, the default settings favour the assembly of larger genomes with potentially lower coverage, with defaults that may not make sense for high-coverage bacterial genomes.

Using Illumina reads for read correction

Correcting PacBio long reads using Illumina short reads is another option for read-correction. This was first made possible by the pbToCA algorithm. Later on, error-correction using Illumina reads also became possible inside the smrtpipe pipeline. However, the latest versions of the smrtpipe fail when running this protocol (it is missing the “smrtanalysis-2.3.0/analysis/bin/align2layouts.py” script). I have been told smrtpipe (version 2.3.0) no longer supports read-correction using Illumina reads, even though the failing protocol was taken from the smrtpipe user guide version 2.3.0. Be this as it may, using smrtpipe to do Illumina error-correction only works in smrtanalysis verion 2.1.1 or earlier. Alternatively, other options also exist outside the smrtanalysis package (see later sections for this).

In the following example, the input.xml file needs to point to both PacBio reads (here in the form of one bax.h5 file), and Illumina reads (formatted as a fastq file):

<?xml version="1.0"?>

<pacbioAnalysisInputs>

   <dataReferences>

        <url ref="run:0000000-0001">

           <location>/path/to/bax.h5</location>

       </url>

       <url ref="fastq:/path/to/Fastq"/>

   </dataReferences>

</pacbioAnalysisInputs>

 The following settings.xml file will perform the read-correction (note the “useFastqAsShortReads” parameter):

<?xml version="1.0" ?>

<smrtpipeSettings>

  <module name="P_Fetch"/>

    <module name="P_Filter">

        <param name="filters">

               <value>MinRL=1000,MinReadScore=0.80</value>

        </param>

        <param name="artifact">

               <value>-1000</value>

        </param>

    </module>

    <module name="P_PreAssembler">

        <param name="useFastqAsShortReads">

            <value>True</value>

        </param>

        <param name="useFastaAsLongReads">

            <value>False</value>

        </param>

        <param name="useLongReadsInConsensus">

            <value>False</value>

        </param>

        <param name="useUnalignedReadsInConsensus">

            <value>False</value>

        </param>

        <param name="blasrOpts">

            <value>-minMatch 8 -minReadLength 30 -maxScore -100 -minPctIdentity 70 -bestn 100</value>

        </param>

        <param name="layoutOpts">

            <value>--overlapTolerance=25</value>

        </param>

        <param name="consensusOpts">

            <value>-w 2</value>

        </param>

   </module>

</smrtpipeSettings>

 Run this example with (remember NOT to load the newest smrtanalysis package!):

module load smrtanalysis/2.1.1

smrtpipe.py -D TMP=/path/to/outputFolder -D SHARED_DIR=/path/to/outputFolder -D NPROC=8 --output=/path/to/outputFolder --params=/path/to/settings.xml xml:/path/to/input.xml

The output of this protocol will be found in the "data" subfolder, in the form of both corrected fasta and fastq files:

/outputFolder/data/corrected.fasta /outputFolder/data/corrected.fastq