Difference between revisions of "RNASeq: Mapping reads to a reference sequence"

From mn/ibv/bioinfwiki
Jump to: navigation, search
(Created page with "= Introduction = On of the most popular mapping algorithms is the Bowtie program. However, this program cannot map reads that are spanning two exons. In order to enable mappi...")
 
Line 5: Line 5:
 
= Mapping reads with Tophat =
 
= Mapping reads with Tophat =
  
Since Tophat uses the Bowtie algorithm for the mapping of reads to a reference sequences, both programs must be installed for Tophat to be working. Both of these program are present on Abel; previous problems in letting the programs communicate have now been solved. However, the user must run the newest Tophat installation on Abel, not the default one.  
+
Since Tophat uses the Bowtie algorithm for the mapping of reads to a reference sequences, both programs must be installed for Tophat to be working. Both of these program are present on Abel; previous problems in letting the programs communicate have now been solved. However, the user must run the newest Tophat installation on Abel, not the default one.
  
 
== Dealing with spaces in fasta sequence headers ==
 
== Dealing with spaces in fasta sequence headers ==
  
Before starting the mapping process, is is important to check whether the reference sequence fasta headers contain spaces. If so, this could lead to problems later in the mapping and expression analysis process. The reason for this is that some programs treat the fasta header section before the first space as the identifier, the rest is treated as a comment. This seems to be the case on for instance in biopython, or in the samtools programs, which are used by Tophat. Confusingly, the Picard programs that often are used together with samtools treat spaces differently. In order to avoid this problem, it may be a good idea to remove spaces in fasta headers (see here SMRT_Analysis:_Mapping_reads_to_a_reference). Notice however, that this may break other naming relationships. For instance, the annotation contained in a GTF file may contain fasta header names with spaces. If so, spaces must be removed from the annotation files as well.  
+
Before starting the mapping process, is is important to check whether the reference sequence fasta headers contain spaces. If so, this could lead to problems later in the mapping and expression analysis process. The reason for this is that some programs treat the fasta header section before the first space as the identifier, the rest is treated as a comment. This seems to be the case on for instance in biopython, or in the samtools programs, which are used by Tophat. Confusingly, the Picard programs that often are used together with samtools treat spaces differently. In order to avoid this problem, it may be a good idea to remove spaces in fasta headers (see here SMRT_Analysis:_Mapping_reads_to_a_reference). Notice however, that this may break other naming relationships. For instance, the annotation contained in a GTF file may contain fasta header names with spaces. If so, spaces must be removed from the annotation files as well.
  
 
== Creating the bowtie index ==
 
== Creating the bowtie index ==
  
Before starting to map reads with Tophat, an index for the reference sequence must be created. Tophat can use both Bowtie and the newer Bowtie2 programs; it is recommended to use Bowtie2. This means also using Bowtie2 in order to create the sequence index. The <span style="font-family:courier new,courier,monospace;">bowtie2-build</span> program must be given the filename of a fasta file containing the reference sequence. It is important that this file ends with a "<span style="font-family:courier new,courier,monospace;">.fa</span>" extension; rename the file if this is not the case. <span style="font-family:courier new,courier,monospace;">bowtie2-build</span> also must be given the output name. In order to avoid problems later on, the output name should be equal to the input sequence file name, excluding the "<span style="font-family:courier new,courier,monospace;">.fa</span>" extension:  
+
Before starting to map reads with Tophat, an index for the reference sequence must be created. Tophat can use both Bowtie and the newer Bowtie2 programs; it is recommended to use Bowtie2. This means also using Bowtie2 in order to create the sequence index. The <span style="font-family:courier new,courier,monospace;">bowtie2-build</span> program must be given the filename of a fasta file containing the reference sequence. It is important that this file ends with a "<span style="font-family:courier new,courier,monospace;">.fa</span>" extension; rename the file if this is not the case. <span style="font-family:courier new,courier,monospace;">bowtie2-build</span> also must be given the output name. In order to avoid problems later on, the output name should be equal to the input sequence file name, excluding the "<span style="font-family:courier new,courier,monospace;">.fa</span>" extension:
 +
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;">
 +
module load bowtie2
  
module load bowtie2
+
bowtie2-build /path/to/refSequence.fasta /path/to/refSequence
 
 
bowtie2-build /path/to/refSequence.fasta /path/to/refSequence  
 
  
 
ls -l
 
ls -l
 
+
</div>
 
This will build the index, displaying the index files (files ending with a ".bt2" extension).
 
This will build the index, displaying the index files (files ending with a ".bt2" extension).
  
Line 26: Line 26:
  
 
A default exection of Tophat needs an output folder, the reference sequence index files, at least one read file:
 
A default exection of Tophat needs an output folder, the reference sequence index files, at least one read file:
 
+
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;">
 
module load tophat/2.0.14
 
module load tophat/2.0.14
  
 
tophat -o /path/to/outputFolder /path/to/refSequence /path/to/readFile.fastq.gz
 
tophat -o /path/to/outputFolder /path/to/refSequence /path/to/readFile.fastq.gz
 
+
</div>
 
It is often necessary to include more files and options, for instance, a GTF annotation file can be included so as to guide Tophat in the mapping process. Also, if using paired-end sequencing, the forward and reverse read files must both be provided. See the Tophat manual and the tutorials for details. Especial care must be taken if using strand-specific sequencing data, see here for an overview.
 
It is often necessary to include more files and options, for instance, a GTF annotation file can be included so as to guide Tophat in the mapping process. Also, if using paired-end sequencing, the forward and reverse read files must both be provided. See the Tophat manual and the tutorials for details. Especial care must be taken if using strand-specific sequencing data, see here for an overview.
  
Line 50: Line 50:
  
 
In order to obtain the insert size values, do:
 
In order to obtain the insert size values, do:
 +
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;">
 +
module load R
  
module load R
+
module load picard-tools
 
 
module load picard-tools  
 
  
 
java -Xmx1G -jar /cluster/software/VERSIONS/picard-tools-1.119/bin/CollectInsertSizeMetrics.jar H=/path/to/histogramFile.pdf I=/path/to/bamFile.bam O=/path/to/outputFile.txt
 
java -Xmx1G -jar /cluster/software/VERSIONS/picard-tools-1.119/bin/CollectInsertSizeMetrics.jar H=/path/to/histogramFile.pdf I=/path/to/bamFile.bam O=/path/to/outputFile.txt
  
 
more /path/to/outputFile.txt
 
more /path/to/outputFile.txt
 
+
</div>
 
<br/>This will display the insert size metrics, including the average insert size and the standard deviation. The <span style="font-family:courier new,courier,monospace;">/path/to/histogramFile.pdf</span> file can be transferred to the loacl machine and viewed in any PDF viewer.
 
<br/>This will display the insert size metrics, including the average insert size and the standard deviation. The <span style="font-family:courier new,courier,monospace;">/path/to/histogramFile.pdf</span> file can be transferred to the loacl machine and viewed in any PDF viewer.

Revision as of 11:01, 28 May 2015

Introduction

On of the most popular mapping algorithms is the Bowtie program. However, this program cannot map reads that are spanning two exons. In order to enable mapping of such reads, and with the added benefit of being able to confidently annotate exon-intron boundaries, the Tophat program has been created. It uses Bowtie as the underlying mapping algorithm, but is able to assign reads not mapped directly by Bowtie to distinct exons. For details, pleae see the Tophat manual (https://ccb.jhu.edu/software/tophat/manual.shtml).

Mapping reads with Tophat

Since Tophat uses the Bowtie algorithm for the mapping of reads to a reference sequences, both programs must be installed for Tophat to be working. Both of these program are present on Abel; previous problems in letting the programs communicate have now been solved. However, the user must run the newest Tophat installation on Abel, not the default one.

Dealing with spaces in fasta sequence headers

Before starting the mapping process, is is important to check whether the reference sequence fasta headers contain spaces. If so, this could lead to problems later in the mapping and expression analysis process. The reason for this is that some programs treat the fasta header section before the first space as the identifier, the rest is treated as a comment. This seems to be the case on for instance in biopython, or in the samtools programs, which are used by Tophat. Confusingly, the Picard programs that often are used together with samtools treat spaces differently. In order to avoid this problem, it may be a good idea to remove spaces in fasta headers (see here SMRT_Analysis:_Mapping_reads_to_a_reference). Notice however, that this may break other naming relationships. For instance, the annotation contained in a GTF file may contain fasta header names with spaces. If so, spaces must be removed from the annotation files as well.

Creating the bowtie index

Before starting to map reads with Tophat, an index for the reference sequence must be created. Tophat can use both Bowtie and the newer Bowtie2 programs; it is recommended to use Bowtie2. This means also using Bowtie2 in order to create the sequence index. The bowtie2-build program must be given the filename of a fasta file containing the reference sequence. It is important that this file ends with a ".fa" extension; rename the file if this is not the case. bowtie2-build also must be given the output name. In order to avoid problems later on, the output name should be equal to the input sequence file name, excluding the ".fa" extension:

module load bowtie2

bowtie2-build /path/to/refSequence.fasta /path/to/refSequence

ls -l

This will build the index, displaying the index files (files ending with a ".bt2" extension).

Running Tophat

A default exection of Tophat needs an output folder, the reference sequence index files, at least one read file:

module load tophat/2.0.14

tophat -o /path/to/outputFolder /path/to/refSequence /path/to/readFile.fastq.gz

It is often necessary to include more files and options, for instance, a GTF annotation file can be included so as to guide Tophat in the mapping process. Also, if using paired-end sequencing, the forward and reverse read files must both be provided. See the Tophat manual and the tutorials for details. Especial care must be taken if using strand-specific sequencing data, see here for an overview.

Estimating the insert size and standard deviation

If using paired-end sequencing, the --mate-inner-dist and the --mate-std-dev parameters should be specified. Your sequencing provider should be able to report these values. If not, they can be estimated by performing a mapping using Bowtie, and calculating the values using the CollectInsertSizeMetrics program in the Picard package. The Bowtie mapping must be done with the transcriptome, not the genome, as the reference sequence (in order to avoid that introns distort the insert size estimates). If a trancsriptome file is not available, it can be created from a GFF3 annotation file and the genomic sequences file using the gffread utility. This program is part of the cufflinks module which can be loaded as:

module load cufflinks

(cufflinks is not displayed when using the module avail statement, but is nevertheless installed on Abel!)

See the gffread manual for details: http://cole-trapnell-lab.github.io/cufflinks/file_formats/#the-gffread-utility

It is not necessary to use all reads for getting an insert size estimate. In order to extract 10000 reads from a fastq.gz read file, use:

zcat /path/to/readFile.fastq.gz | head -40000 > /path/to/outputFile.fastq

Use Bowtie together with the transcriptome and the resulting forward and reverse read files so as to create the BAM file.

In order to obtain the insert size values, do:

module load R

module load picard-tools

java -Xmx1G -jar /cluster/software/VERSIONS/picard-tools-1.119/bin/CollectInsertSizeMetrics.jar H=/path/to/histogramFile.pdf I=/path/to/bamFile.bam O=/path/to/outputFile.txt

more /path/to/outputFile.txt


This will display the insert size metrics, including the average insert size and the standard deviation. The /path/to/histogramFile.pdf file can be transferred to the loacl machine and viewed in any PDF viewer.