Difference between revisions of "PBcR: Input file formats"

From mn/ibv/bioinfwiki
Jump to: navigation, search
 
Line 5: Line 5:
 
== PacBio reads ==
 
== PacBio reads ==
  
The PacBio reads that are to be read-corrected (and subsequently used in the assembly) have to be present in the fastq file format. PacBio reads formatted as a fasta file can be converted to the fastq format using the <span style="font-family:courier new,courier,monospace;">convertFastaAndQualToFastq.jar</span> program (see [[PBcR:_Installation|here]]):
+
The PacBio reads that are to be read-corrected (and subsequently used in the assembly) have to be present in the fastq file format. PacBio reads formatted as a fasta file can be converted to the fastq format using the <span style="font-family:courier new,courier,monospace;">convertFastaAndQualToFastq.jar</span> program (see [[PBcR: Installation|here]]):
  
 
<span style="font-family:courier new,courier,monospace;">java -jar convertFastaAndQualToFastq.jar /path/to/reads.fasta > /path/to/reads.fastq</span>
 
<span style="font-family:courier new,courier,monospace;">java -jar convertFastaAndQualToFastq.jar /path/to/reads.fasta > /path/to/reads.fastq</span>
Line 15: Line 15:
 
bash5tools.py --readType subreads --outType fastq /path/to/bas.h5
 
bash5tools.py --readType subreads --outType fastq /path/to/bas.h5
 
</div>
 
</div>
This will create a fastq file with the same file name as the bas.h5 file, but ending with a ".fastq" extension. For details, see here.
+
This will create a fastq file with the same file name as the bas.h5 file, but ending with a ".fastq" extension. For details, see [[SMRT_Analysis:_Read_filtering|here]].
  
 
== Illumina short reads ==
 
== Illumina short reads ==

Latest revision as of 20:49, 29 April 2015

PBcR is build on the Celera Assembler (CA), and as such the input file formats reflect this. The two main input file types are files containing reads, and the spec file containing input parameters for the CA.

Input formats for read files

PacBio reads

The PacBio reads that are to be read-corrected (and subsequently used in the assembly) have to be present in the fastq file format. PacBio reads formatted as a fasta file can be converted to the fastq format using the convertFastaAndQualToFastq.jar program (see here):

java -jar convertFastaAndQualToFastq.jar /path/to/reads.fasta > /path/to/reads.fastq

This command will re-format the fasta-reads file as fastq-reads, using the character 'I' as the quality value (corresponding to the number '40'). If working with PacBios HDF5 format (i.e. bas.h5 or bax.h5 files), reads can be extracted using the bash5tools.py script which is part of PacBio's SMRT Analysis package (installed on Abel):

module load smrtanalysis/2.3.0

bash5tools.py --readType subreads --outType fastq /path/to/bas.h5

This will create a fastq file with the same file name as the bas.h5 file, but ending with a ".fastq" extension. For details, see here.

Illumina short reads

The Illumina short reads used to error-correct the long PacBio reads must be contained within a CA fragment (*.frg) file. It is important to understand that a fragment file only contains a description of the read file included; it does not itself contain any read data. The fragment file only points to the file location where the actual read data is located. This means that moving, re-naming or deleting the Illumina file after including it in a fragment file will cause PBcR to fail, with the program unable to find the referred read file.

Creating the fragment file can be acomplished using the fastqToCA script contained in the PBcR package (i.e. in the 'bin' subfolder):

PBcR/wgs-8.3rc1/Linux-amd64/bin/fastqToCA -libraryname illumina -type sanger -innie -reads /path/to/reads.fastq > /path/to/reads.frg

This example creates a fragment file for single-end reads. If using paired-end reads instead:

PBcR/wgs-8.3rc1/Linux-amd64/bin/fastqToCA -libraryname illumina -type sanger -innie -insertsize 200 50 -mates /path/tofwdReads.fastq, /path/to/revReads.fastq > /path/to/reads.frg

(notice that you need to specify the insert size, together with the average variation thererof.)

Use the fastqToCA script without any arguments to display the help file, explaining the input parameters:

PBcR/wgs-8.3rc1/Linux-amd64/bin/fastqToCA

A similar script is available for the fasta format: fastaToCA.

(Notice that the script pacBioToCA performs error correction, rather than re-formatting reads as a fragment file).

Input format of the 'spec' file

CA is a complicated algorithm that can be controlled by a plethora of input parameters. In order to make it easier to use, most input parameters can be included in specifications ('spec') file. This file consists of pairs of input parameters and their values. On a single system, no parameter adjustment should be necessary. PBcR will detect your available resources and appropriately adjust parameters to fill the memory/CPUs on your system.

The following depicts a minimal spec file, which will run on Abel:

merSize=14

If performing read-correction and assembly of large (>100 mb) genomes, the following spec file is recommended:

merSize=14

maxGap=1500

ovlHashBlockLength=1000000000

ovlRefBlockLength=1000000000

blasr=-noRefineAlign -advanceHalf -noSplitSubreads -minMatch 10 -minPctIdentity 70 -bestn 24 -nCandidates 24