Difference between revisions of "SMRT Analysis: Read filtering"

From mn/ibv/bioinfwiki
Jump to: navigation, search
Line 41: Line 41:
 
For instance, the following command displays the individual sections present in a bas.h5 file:
 
For instance, the following command displays the individual sections present in a bas.h5 file:
  
h5ls -r /path/to/bas.h5 
+
<span style="font-family:courier new,courier,monospace;">h5ls -r /path/to/bas.h5&nbsp;</span>
  
 
= Filtering reads PacBio reads =
 
= Filtering reads PacBio reads =

Revision as of 16:29, 9 March 2015

Introduction

PacBio stores reads and the associated quality values in HDF5-encoded files, which is a binary file format. This means you cannot open these files using a text editor program, unlike for instance fastq or SAM files. If working only with the smrtpipe program, it often may not be necessary to manually inspect the content of HDF5 files. But if using external programs for quality assessment, assembly or other steps, it will become necessary to access these files. This also applies to some specialized usages of the smrtpipe.

Accessing PacBio HDF5 files

PacBio uses two main types of HDF5 files, the bas.h5/bax.h5 files and the cmp.h5 files. Reads are stored in the former format, while alignments (resulting from intermediate steps in the smrtpipe pipeline) are contained in cmp.h5 files.

Prior to the PacBio RS II technology, reads were encoded in one single bas.h5 file. The current version uses three bax.h5 files instead; the (single) bas.h5 still present only holds indices to the reads now contained in the bax.h5 files. The three bax.h5 file names will terminate with “.n.bax.h5”, n giving the part number (i.e. n = 1, 2 or 3). Using PacBio RS II, the bax.h5 files will serve as the input to the smrtpipe program; here, the bas.h5 file can often be ignored.

In the following, only the term bas.h5 will be used, meaning both bas.h5 and bax.h5

Extracting reads from the bas.h5/bax.h5 files

The bash5tools.py script, included in the smrtanalysis package, can be used to extract the reads contained in a bas.h5 file:

module load smrtanalysis/2.3.0

bash5tools.py --readType subreads --outType fasta /path/to/bas.h5

The resulting file will have the same name as the bas.h5 file, but ending with a “.fasta” extension (in case of using bax.h5 files, the resulting file will NOT contain the bax.h5 part number).

The bash5tools.py script can export reads both in the fasta and fastq format. In the latter case, quality values are encoded as phred33 quality scores. Also, three distinct types of PacBio reads can be exported: --readTYPE {ccs, subreads, unrolled}.

  • the “ccs” option exports circular consensus reads, i.e. sequences that have been read multiple times by one sequencing detector.
  • the “subreads” option exports all the subreads present in the raw reads, i.e. the raw reads split by the adaptor sequences.
  • the “unrolled” option exports all raw reads.

For more information about these read types, see the PacBio documentation.

More options for the bash5tools.py script are obtained by using the “-h” option, including for instance quality filtering of the exported reads.

Working with cmp.h5 files

The cmp.h5 files contain alignment information produced by certain steps in the smrtpipe program, and serve themselves as input files for subsequent smrtpipe steps. As for the bas.h5 files, a cmph5tools.py script is available for manipulating cmp.h5 files, included in the smrtanalysis package. Use the “-h” option for further details about this script.

General tools for reading HDF5 files:

The HDF group has made available a number of tools used to manipulate HDF5 files. The HDFView program (http://www.hdfgroup.org/products/java) is a Java program providing a graphical user interface to the content of HDF5 files. Additionally, a number of command line tools are available (http://www.hdfgroup.org/products/hdf5_tools/#tools), most of which are installed on Abel.

For instance, the following command displays the individual sections present in a bas.h5 file:

h5ls -r /path/to/bas.h5 

Filtering reads PacBio reads

PacBio quality values

When exporting PacBio reads to a fastq file, the resulting quality values are phred33-encoded quality values. However, these values are calculated from a number of different quality scores encoded within the bas.h5 file. Thus, the precise meaning of the PacBio values may differ from for instance Illumina quality values. For example, the main errors in PacBio sequencing are indels, not incorrect base calls. The inverse of a quality value for a given base gives the probability of this base in reality being something else, including a gap. But it is unclear where to look for the probability that certain base is missing in the PacBio sequence (i.e. a quality value for a gap in the PacBio sequence). When looking at unfiltered PacBio reads, I seeing mean quality scores of around 80%, to be expected for PacBio reads. Thus, the exported phred33 quality values are probably precise enough to allow a meaningful assessment of sequence qualities using tools developed for other sequencing techniques.

FastQC and PacBio reads

The more recent versions of the FastQC program allow the assessment of PacBio read qualities. The FastQC results are presented as for any other sequencing technology. For instance, the display of per-base quality values uses the standard ordering into good (green), mediocre (yellow) and bad (red) categories. I.e. FastQC does not take into account that PacBio reads are by definition of poorer quality than for instance Illumina reads. This of course means that a “fail” status on the per base sequence quality is not meaningful for these reads.

Using the smrtpipe P_Filter module

It is easy to use the filtering module included in the smrtpipe program on its own. Most (maybe all) protocols would include the usage of this module anyway, so an isolated usage of this module makes only sense if subsequent processing is done with tools not included in the smrtpipe. A protocol using the P_Filter module basically stops after the filtering step.

The following protocol produces filtered reads from the bas.h5 files included in the input.xml file:

<?xml version="1.0" encoding="utf-8"?>

<smrtpipeSettings>

  <protocol id="RS_HGAP_Assembly.3" version="2.3.0" editable="false">

  </protocol>

  <module id="P_Fetch">

  </module>

  <module id="P_Filter">

   <param name="minSubReadLength">

     <value>500</value>

   </param>

   <param name="readScore">

     <value>0.80</value>

   </param>

   <param name="minLength">

    <value>100</value>

   </param>

  </module>

  <module id="P_FilterReports">

  </module>

</smrtpipeSettings>

 

The P_Fetch module is always required in smrtpipe protocols; the P_FilterReports module generates a HTML-based report (it can be excluded if not necessary). In this example, the P_Filter module accepts reads with a mean read score of better than 0.80. This corresponds to a phred33 quality score of around 6, as can be verified if running FastQC on the produced output.

For details of the P_Filter module, such as filtering parameters, input and output file names, see the smrtpipe user guide (https://github.com/PacificBiosciences/SMRT-Analysis/wiki/SMRT-Pipe-Reference-Guide-v2.3.0#P_Filter).

The functionality of the P_Filter module is partially overlapping with that of the bash5tools.py script. Using identical readscore parameters, these two programs give identical results. But the P_Filter module provides more additional options than the bash5tools.py script, such as for instance a defined signal-to-noise ratio.

Filtering reads based on sequence identity (whitelisting)

In addition to excluding reads below a given quality score, often one wants to exclude certain reads based on their sequence. For instance, it may be necessary to filter out bacterial reads from a eukaryotic sample. In this case, the problematic reads may be identified by aligning all reads to a bacterial genome, keeping only the non-aligning reads.

As a first step, the alignment is done using the BLASR program. This program is capable of aligning long PacBio reads to genomic sequences. It is included in the smrtanalysis package, and is also used internally in the smrtpipe, for instance during read correction.

Next, the ids of all aligned reads are recorded (constituting in effect a “blacklist”).

Finally, going through all reads in the bas.h5 input file, all reads not found to be blacklisted are written to a text file. This text file constitutes the whitelist, and must be included in the protocol.xml file (P_Filter module):

<param name="whiteList" label="Read IDs to whitelist">

<value>/path/to/whitelist</value>

</param>

 

Using the whitelisting procedure includes using HDF5 commandline tools to extract read ids, and running a custom script for generating the whitelist. See https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP-Whitelisting-Tutorial for details.