RNASeq: Quality control

From mn/ibv/bioinfwiki
Jump to: navigation, search

Before mapping reads to a reference sequence, it is important to assess the quality of the reads, possibly removing low-quality reads.

Viewing fastq read files

In order just to get a glimpse at the sequencing data, standard text commands can be used on Abel, such as less, more or cat. Since the read files often are compressed (usually ending in a ".gz" extension), special text commands can be used instead. Therefore, the read files do not have to be unzipped before usage (also, most bioinformatic programs working with fastq read files will accept gz-compressed files directly). Possible text commands include zmore, zcat and less (the latter can thus be used both for compressed and un-compressed text data).

Generating quality reports

The fastaqc program can be used to generate quality reports for fastq sequencing data. This program is installed on Abel, and can be used as follows:

module load fastqc

fastqc /path/to/nameOfReadFile.fastq

(Note: fastqc also accepts gz-compressed read files; i.e. the read file does not have to be unzipped before use).

Executing the fastqc program only takes seconds even for large fastq read files; it is not necessary to start a job on Abel for this. Doing this directly from the Abel front-end should not pose any problems; alternatively you can use freebee. The "/path/to" folder will now contain two new, fastqc-generated files: nameOfReadFile_fastq.html and nameOfReadFile_fastq.zip. The *.html file contains the quality report that can be viewed in a browser; the *.zip file contains the same information in a zipped format. To view this report, either copy the *.html file over to your local machine and open it in a borwse, alternatively you can log onto Abel using the X11 system and open a browser on Abel directly (see here for a guide).

Trimming low-quality reads

If the fastqc quality report has revealed low-quality reads (or sections of poor qualities in otherwise good reads), the Trimmomatic program (http://www.usadellab.org/cms/?page=trimmomatic) will remove entire low-quality reads, or delete the low-quality part of reads. This program is not installed on Abel.

In order to download and install it on Abel, execute:

mkdir trimmomatic

cd trimmomatic

wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.33.zip

unzip Trimmomatic-0.33.zip


In order to execute Trimmomatic, step into the "Trimmomatic-0.33" sub-folder and use Java to start the .jar file:

cd Trimmomatic-0.33

java -jar trimmomatic-0.33.jar


See the Trimmomatic web page for examples of how to use the program. Note that you can pass in several read files at once - you do not have to run Trimmomatic repeatedly if you have many read files. It is interesting to see the choice of quality cut-off values used on that web page - these example values are much lower than what fastqc suggests are acceptable scores. If in doubt about which values to use, consult the wikipedia page for the precise meaning of phred quality scores: http://en.wikipedia.org/wiki/Phred_quality_score.

Assessing the effects of quality trimming

After removing low-quality reads, it is a good idea to assess how many reads and characters were removed from your read files. If loosing a great amount of data, possibly the quality trimming should be repeated with less stringent values (ideally, obviously, one should decide a priori on acceptable quality values, and stick with these...)

One simple thing to do is to compare the read file sizes before and after trimming. Use the "ls -lh" command so as to get better human-readable file size numbers.

This will tell you how much data was lost. However, losses could occur either by removing (relatively few) entire reads, or alternatively the trimming of relatively many reads. In order to understand exactly what has happened, the number of fastq records before and after quality trimming can be compared:

zcat /path/to/readFile.fastq.gz | grep "^+$" | wc

(Here, the grep command, reading the un-compressed fastq file, is looking for line consisting only of the "+" character, used in the fastq format to separate the sequence from the quality values. These occurencies are passed into the word-count command wc.)