SMRT Analysis: Mapping reads to a reference
The smrtpipe can be used to map PacBio reads to reference sequences, useful for instance in variant calling. As always, the input.xml file must specify the filenames of the files containing the reads to be mapped. These files will often be the original bas.h5 or bax.h5 files retrieved from the PacBio instrument, but may also be FASTA or FASTQ files (see the description of the input.xml file).
The filename pointing to the reference sequences must be included in the protocol.xml file, rather than in the input.xml file. However, you cannot directly include the filename of the FASTA file containing the reference sequences. Rather, this file must be added to a valid PacBio reference depository. This is done using the referenceUploader program that is part of the smrtanalysis package.
Uploading the reference sequences
The reference sequences must be formatted as a FASTA file. The FASTA headers should not contain any spaces, as this may interfere with creating the *.fai index file (see below). Spaces can be removed using the sed command:
sed -i 's/ //g' <filename>
The FASTA file can now be converted into a valid PacBio reference repository. This repository should be located in a folder of its own - let us assume the chosen folder name is ‘pacbioReferenceRepository’. The referenceUploader tool will create a subfolder which will contain the reference sequence file (and additional files). Specify a short name that will not cause conflicts later on. For instance, using “ecoli” is a bad choice if multiple strains of e. coli are later added to the repository - “ecoliK12” may be a better name.
We are now ready to use the referenceUploader tool (using the values given above):
If using large reference files (i.e. larger than bacterial genomes), add the following argument:
This creates an additional index that will help in the mapping step.
The referenceUploader tool creates the following folders and files:
The ”/path/to/pacbioReferenceRepository/ecoliK12/sequence/” folder contains the actual FASTA sequence file, together with additional index files. The ”/path/to/pacbioReferenceRepository/ecoliK12” folder name is the name you need to specify as the reference file name in the protocol.xml file (i.e. NOT the filename to the actual FASTA sequence file). You do this by adding a parameter to the <global> section of the protocol.xml file:
Additionally, some smrtpipe protocols demand the presence of a samtools faidx index file for the reference sequence FASTA file. You create this index using samtools:
After running this, you should see an index file with the same filename as the FASTA file, but with an ”.fai” extension.
In this example, we have created a new reference depository, adding our reference file to it. Using the referenceUploader tool, it is also possible to add new reference sequences to an existing repository, as well as updating existing repositories. For details, see the referenceUploader help section:
Running a mapping protocol
In addition to the reference repository, we need to create the input.xml file pointing to the read data. The following is an example for a bax.h5 file (see here for input.xml file details):
The protocol.xml file specifies which programs the smrtpipe executes:
(This file is adopted from the “/cluster/software/VERSIONS/smrtanalysis-2.3.0/doc/examples/smrtpipe_resequencing/params.xml” examples file. See this file for further explanations of the modules and parameters involved).
We can now run this protocol.xml file as follows: