Difference between revisions of "PacBio smrtpipe commandline"

From mn/bio/cees-bioinf
Jump to: navigation, search
(Created page with "=== HOW TO run smrtpipe.py on the command line. === smrtpipe.py is the PacBio pipeline for filtering and mapping PacBio runs, as well as running HGAP. For filtering-only jobs, ...")
 
(How to obtain the settings.xml file)
 
(11 intermediate revisions by the same user not shown)
Line 9: Line 9:
 
*a settings.xml file, see below, specifies the job. See the end of this page for how to obtain such a xml file
 
*a settings.xml file, see below, specifies the job. See the end of this page for how to obtain such a xml file
  
<br> '''STEPS'''
+
==== Common steps  ====
  
set up environment  
+
Set up environment  
 
<pre>module load  smrtanalysis/x.y.z # at the time of writing, 'x.y.z' is 2.0.1</pre>  
 
<pre>module load  smrtanalysis/x.y.z # at the time of writing, 'x.y.z' is 2.0.1</pre>  
generate smrtcells.fofn file  
+
Generate smrtcells.fofn file  
 
<pre>find path/to/runfile|grep bax.h5 &gt;smrtcells.fofn #NOTE for pre 2,0 versions, use 'bas.h5'!</pre>  
 
<pre>find path/to/runfile|grep bax.h5 &gt;smrtcells.fofn #NOTE for pre 2,0 versions, use 'bas.h5'!</pre>  
convert this file to correct xml format:  
+
Convert this file to correct xml format:  
 
<pre>fofnToSmrtpipeInput.py smrtcells.fofn &gt;input.xml</pre>  
 
<pre>fofnToSmrtpipeInput.py smrtcells.fofn &gt;input.xml</pre>  
 
copy a settings xml file, e.g.&nbsp;filter_only_settings.xml,&nbsp;HGAP_settings.xml, ...  
 
copy a settings xml file, e.g.&nbsp;filter_only_settings.xml,&nbsp;HGAP_settings.xml, ...  
Line 22: Line 22:
  
 
Actual filtering&nbsp;(NPROC is the number of CPUs the job gets):  
 
Actual filtering&nbsp;(NPROC is the number of CPUs the job gets):  
<pre>smrtpipe.py -D TMP=./ -D SHARED_DIR=./ -D NPROC=24 --params=settings.xml xml:input.xml &amp;&gt; smrtpipe.err</pre>  
+
<pre>smrtpipe.py -D TMP=./ -D SHARED_DIR=./ -D NPROC=24 --params=desired_settings.xml xml:input.xml &amp;&gt; smrtpipe.err</pre>
=== How to obtain the settings.xml file  ===
 
  
After a smrtportal upgrade, if it is not yet there in the folder /projects/nscdata/scripts/pacbio/smrtpipe_xml_files
+
==== How to obtain the settings.xml file ====
  
*In SMRTportal on cod2, set up a job using the protocol you want to run, e.g. RS_Filter_Only, RS_HGAP_Assembly, you ''don't have to execute ''it  
+
After a smrtportal upgrade, if it is not yet there in the folder /projects/nscdata/scripts/pacbio/smrtpipe_xml_files
*note down the job number  
+
 
*from the folder&nbsp;/projects/nscdata/smrtportal/userdata/jobs/016/, find the folder with your job number  
+
*In SMRTportal on cod5, set up a job using the protocol you want to run, e.g. RS_Filter_Only, RS_HGAP_Assembly, you ''don't have to execute ''it
*copy the settings.xml file over to&nbsp;/projects/nscdata/scripts/pacbio/smrtpipe_xml_files/x.y.x/  
+
*note down the job number (if you can't find it, see below)
 +
*from the folder&nbsp;/projects/nscdata/cod5-smrtportal/userdata/jobs/016/, find the folder with your job number (it is most likely the job with the highest number)
 +
*copy the settings.xml file over to&nbsp;/projects/nscdata/scripts/pacbio/smrtpipe_xml_files/x.y.x/ (where x.y.z. is the version of smrtportal)
 
*give it a smart name (check the other folders), e.g. filter_only_settings.xml,&nbsp;HGAP_settings.xml
 
*give it a smart name (check the other folders), e.g. filter_only_settings.xml,&nbsp;HGAP_settings.xml
 +
 +
==== Working with a reference for mapping  ====
 +
 +
This part is based on information from [https://github.com/PacificBiosciences/SMRT-Analysis/wiki/The-Reference-Repository Pacific Biosciences] (accessed November 2013).
 +
 +
Some applications require a reference sequence, e.g. mapping for SNP calling and detecting base modifications, or for running Quiver for assembly polishing. ''smrtanalysis'' needs a reference repository for this, which you can create with the following command:
 +
<pre>referenceUploader -c -p destination_folder -n short_name -f path/to/reference.fasta</pre>
 +
For large genomes (probably anything larger than a bacterial genome), modify the command as such, to create a BLASR suffix array to speed up the mapping:
 +
<pre>referenceUploader -c -p destination_folder -n short_name -f path/to/reference.fasta --saw='sawriter -welter'</pre>
 +
NOTES
 +
 +
*several references can be added to the same destination_folder
 +
*pick a convenient short name
 +
*for large fasta files, building the index file takes some time (a 611 Mbp cod assembly fasta file took almost 2 hours)
 +
*when I do this, I get warnings about 'SLF4J', it seems these can be ignored
 +
 +
'''Using the reference repository'''
 +
 +
In the ''settings.xml'' file, there is a part saying:
 +
<pre>&lt;param name="reference" hidden="true"&gt;
 +
  &lt;value&gt;common/references/lambda&lt;/value&gt;
 +
&lt;/param&gt;</pre>
 +
Change the '''common/references/lambda''' part to the full path to the destination_folder<br>

Latest revision as of 14:00, 14 April 2016

HOW TO run smrtpipe.py on the command line.

smrtpipe.py is the PacBio pipeline for filtering and mapping PacBio runs, as well as running HGAP. For filtering-only jobs, it splits the raw reads on the adaptor and generates all subreads in fasta or fastq format.

Input:

  • input.fofn, list of files to process (see below)
  • bax.h5 files (or, before 2.0 software version, bas.h5 files) from the run (one per movie), raw-output from the PacBio
  • a settings.xml file, see below, specifies the job. See the end of this page for how to obtain such a xml file

Common steps

Set up environment

module load  smrtanalysis/x.y.z # at the time of writing, 'x.y.z' is 2.0.1

Generate smrtcells.fofn file

find path/to/runfile|grep bax.h5 >smrtcells.fofn #NOTE for pre 2,0 versions, use 'bas.h5'!

Convert this file to correct xml format:

fofnToSmrtpipeInput.py smrtcells.fofn >input.xml

copy a settings xml file, e.g. filter_only_settings.xml, HGAP_settings.xml, ...

rsync /projects/nscdata/scripts/pacbio/smrtpipe_xml_files/x.y.z/desired_settings.xml . 

NOTE won't work within a screen ..?

Actual filtering (NPROC is the number of CPUs the job gets):

smrtpipe.py -D TMP=./ -D SHARED_DIR=./ -D NPROC=24 --params=desired_settings.xml xml:input.xml &> smrtpipe.err

How to obtain the settings.xml file

After a smrtportal upgrade, if it is not yet there in the folder /projects/nscdata/scripts/pacbio/smrtpipe_xml_files

  • In SMRTportal on cod5, set up a job using the protocol you want to run, e.g. RS_Filter_Only, RS_HGAP_Assembly, you don't have to execute it
  • note down the job number (if you can't find it, see below)
  • from the folder /projects/nscdata/cod5-smrtportal/userdata/jobs/016/, find the folder with your job number (it is most likely the job with the highest number)
  • copy the settings.xml file over to /projects/nscdata/scripts/pacbio/smrtpipe_xml_files/x.y.x/ (where x.y.z. is the version of smrtportal)
  • give it a smart name (check the other folders), e.g. filter_only_settings.xml, HGAP_settings.xml

Working with a reference for mapping

This part is based on information from Pacific Biosciences (accessed November 2013).

Some applications require a reference sequence, e.g. mapping for SNP calling and detecting base modifications, or for running Quiver for assembly polishing. smrtanalysis needs a reference repository for this, which you can create with the following command:

referenceUploader -c -p destination_folder -n short_name -f path/to/reference.fasta

For large genomes (probably anything larger than a bacterial genome), modify the command as such, to create a BLASR suffix array to speed up the mapping:

referenceUploader -c -p destination_folder -n short_name -f path/to/reference.fasta --saw='sawriter -welter'

NOTES

  • several references can be added to the same destination_folder
  • pick a convenient short name
  • for large fasta files, building the index file takes some time (a 611 Mbp cod assembly fasta file took almost 2 hours)
  • when I do this, I get warnings about 'SLF4J', it seems these can be ignored

Using the reference repository

In the settings.xml file, there is a part saying:

<param name="reference" hidden="true">
   <value>common/references/lambda</value>
</param>

Change the common/references/lambda part to the full path to the destination_folder