RNASeq: Mapping reads to a reference sequence

From mn/ibv/bioinfwiki
Jump to: navigation, search


On of the most popular mapping algorithms is the Bowtie program. However, this program cannot map reads that are spanning two exons. In order to enable mapping of such reads, and with the added benefit of being able to confidently annotate exon-intron boundaries, the Tophat program has been created. It uses Bowtie as the underlying mapping algorithm, but is able to assign reads not mapped directly by Bowtie to distinct exons. For details, pleae see the Tophat manual (https://ccb.jhu.edu/software/tophat/manual.shtml).

Mapping reads with Tophat

Since Tophat uses the Bowtie algorithm for the mapping of reads to a reference sequences, both programs must be installed for Tophat to be working. Both of these program are present on Abel; previous problems in letting the programs communicate have now been solved. However, the user must run the newest Tophat installation on Abel, not the default one.

Dealing with spaces in fasta sequence headers

Before starting the mapping process, is is important to check whether the reference sequence fasta headers contain spaces. If so, this could lead to problems later in the mapping and expression analysis process. The reason for this is that some programs treat the fasta header section before the first space as the identifier, the rest is treated as a comment. This seems to be the case on for instance in biopython, or in the samtools programs, which are used by Tophat. Confusingly, the Picard programs that often are used together with samtools treat spaces differently. In order to avoid this problem, it may be a good idea to remove spaces in fasta headers (see here). Notice however, that this may break other naming relationships. For instance, the annotation contained in a GTF file may contain fasta header names with spaces. If so, spaces must be removed from the annotation files as well.

Creating the bowtie index

Before starting to map reads with Tophat, an index for the reference sequence must be created. Tophat can use both Bowtie and the newer Bowtie2 programs; it is recommended to use Bowtie2. This means also using Bowtie2 in order to create the sequence index. The bowtie2-build program must be given the filename of a fasta file containing the reference sequence. It is important that this file ends with a ".fa" extension; rename the file if this is not the case. bowtie2-build also must be given the output name. In order to avoid problems later on, the output name should be equal to the input sequence file name, excluding the ".fa" extension:

module load bowtie2

bowtie2-build /path/to/refSequence.fasta /path/to/refSequence

ls -l

This will build the index, displaying the index files (files ending with a ".bt2" extension).

Running Tophat

A default exection of Tophat needs an output folder, the reference sequence index files, at least one read file:

module load tophat/2.0.14

tophat -o /path/to/outputFolder /path/to/refSequence /path/to/readFile.fastq.gz

It is often necessary to include more files and options, for instance, a GTF annotation file can be included so as to guide Tophat in the mapping process. Also, if using paired-end sequencing, the forward and reverse read files must both be provided. See the Tophat manual and the tutorials for details. Especial care must be taken if using strand-specific sequencing data, see here for an overview.

Estimating the insert size and standard deviation

If using paired-end sequencing, the --mate-inner-dist and the --mate-std-dev parameters should be specified. Your sequencing provider should be able to report these values. If not, they can be estimated by performing a mapping using Bowtie, and calculating the values using the CollectInsertSizeMetrics program in the Picard package. The Bowtie mapping must be done with the transcriptome, not the genome, as the reference sequence (in order to avoid that introns distort the insert size estimates). If a trancsriptome file is not available, it can be created from a GFF3 annotation file and the genomic sequences file using the gffread utility. This program is part of the cufflinks module which can be loaded as:

module load cufflinks

(cufflinks is not displayed when using the module avail statement, but is nevertheless installed on Abel!)

See the gffread manual for details: http://cole-trapnell-lab.github.io/cufflinks/file_formats/#the-gffread-utility

It is not necessary to use all reads for getting an insert size estimate. In order to extract 10000 reads from a fastq.gz read file, use:

zcat /path/to/readFile.fastq.gz | head -40000 > /path/to/outputFile.fastq

Use Bowtie together with the transcriptome and the resulting forward and reverse read files so as to create the BAM file.

In order to obtain the insert size values, do:

module load R

module load picard-tools

java -Xmx1G -jar /cluster/software/VERSIONS/picard-tools-1.119/bin/CollectInsertSizeMetrics.jar H=/path/to/histogramFile.pdf I=/path/to/bamFile.bam O=/path/to/outputFile.txt

more /path/to/outputFile.txt

This will display the insert size metrics, including the average insert size and the standard deviation. The /path/to/histogramFile.pdf file can be transferred to the loacl machine and viewed in any PDF viewer.