RNASeq: Dealing with stranded sequencing data

From mn/ibv/bioinfwiki
Revision as of 15:15, 26 May 2015 by Ralfne@uio.no (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Introduction

If using strand-specific sequencing, it becomes very important to make sure the coverages for the features-of-interest (usually genes) are calculated along the correct strand. Getting this wrong will mean that the subsequent step of finding differentially expressed genes is finding differentially expressed anti-sense transcripts instead of the actual genes. It may not be easy to identify such a problem at a latter stage, since all subsequent information consists of abstract numbers and identifiers.

The correct way of counting strand-specific sequencing data depends on:

  • the type of molecule sequenced (for instance, cDNAs)
  • the sequencing technology used (for instance, Illumina)
  • the particular sequencing kit used (for instance, lluminaTruSeq Stranded Total RNA Sample Prep Kit with paired-end sequencing)
  • the settings used when mapping the reads to the reference. For the Tophat aligner, the two critical parameters are --library-type, and the order of passing in the forward and reverse read files.
  • the settings used when calculating the read counts.

Example

The following describes a situation where the Illumina TruSeq kit was used to obtain stranded paired-end RNASeq data genereted from cDNA molecules.

Mapping the RNASeq reads to the reference sequence

Since sequencing was done using the TruSeq kit, the Tophat --library-type parameter must be set to fr-firststrand. Also, the reverse read file (R2.fastq.gz) is passed into Tophat before the forward read file (R1.fastq.gz). This is according to the INFBIO9120 tutorial, which states:

Please note that we are providing R2 files first and R1 reads last, although the Tophat2 documentation says we should provide R1 first and R2 second. We do this because the protocol used to produce these data sequence the cDNA, and by switching R1 and R2 we will provide the correct strand orientation to the aligner.

The command to execute thus becomes:

tophat -o /path/to/outputFolder --library-type fr-firststrand --GTF /path/to/annotation.gtf /path/to/referenceSequenceBowtieIndex /path/to/R2.fastq.gz /path/to/R1.fastq.gz > /path/to/log.txt

Visualizing the mapped reads

Even though not strictly required, is is a good idea to visualize the mapped reads, so as to better understand the subsequent steps. In this example, the ordering of the read pairs along the reference sequence determines which strand the seuqenced fragment originated from. For genes on the sense strand (i.e. for genes on the "+" strand, with their 5' ends left and their 3' ends right), the forward read (R1) is the first (left-most) read of the read pair. In Tablet, this is visualized with the "green" R1 reads coming before the "blue" R2 reads (of the same read pair):

Tablet stranded reads.jpg

(In IGV, the read ordering will be designated as "F1R2"). The R1 reads will be identical to the reference sequence; the R2 reads will be the reverse-complement of the reference sequence. For genes on the anti-sense strand (the "-" strand), the read ordering will be the opposite. In Tablet, the "blue" R2 read of a given read pair will come before (i.e. be situated left of) the "green" R1 read. (In IGV, the read ordering will be designated as "F2R1").

Counting reads

Using R and the summarizeOverlaps function

If using R, the "summarizeOverlaps" method (used in the INFBIO9120 tutorial) will now give correct read counts. This can be verified as follows:

Using Tablet (or IGV), identify a region with only a few reads, where all the read pairs are present in the same orientation (i.e. transcription is only seen for one strand). Create a test annotation file that spans this region. In the following gff3 example, replace the <ContigName> with the name of the contig containing the selected region, and <from> and <to> with the numeric coordiantes defining the region:

<ContigName> testSource mRNA <from> <to> . + . ID=ID1;Parent=ID1;Name=ID1

<ContigName> testSource exon <from> <to> . + . Parent=ID1

(Use Tablet to make sure the annotation is created correctly)

The following R code will count the reads contained in the defined region:

library(ShortRead)

library(rtracklayer)

aln <- readGAlignmentPairsFromBam('/path/to/BAMfile.bam', use.names=T)

genes<-import('/path/to/testAnnotationFile.gff3', asRangedData=FALSE)

splitgenes<-split(genes,genes$ID)

genehits<-summarizeOverlaps(splitgenes,aln,mode="Union",singleEnd=FALSE, ignore.strand=FALSE)

counts<-assays(genehits)$counts

counts

This will print the number of reads (of rather, number of fragments) that map to the defined region (it will be zero if the reads stem from the anti-sense strand - the test annotation was created for the "+" strand).

Using HTSeq-count

If using the HTSeq-count python program for read counting, the --stranded parameter must be set to "yes" to count the stranded data in our example correctly. Alternatively, Tophat can be used with the R1 reads before the R2 reads, followed by import into HTSeq-count using the --stranded=reverse parameter (this seems to be the standard way of using HTSeq, according to the creators of HTSeq).