Difference between revisions of "RNASeq: Quality control"

From mn/ibv/bioinfwiki
Jump to: navigation, search
(Created page with "Before mapping reads to a reference sequence, it is important to assess the quality of the reads, possibly removing low-quality reads. = Viewing fastq read files = In order...")
 
Line 1: Line 1:
Before mapping reads to a reference sequence, it is important to assess the quality of the reads, possibly removing low-quality reads.  
+
Before mapping reads to a reference sequence, it is important to assess the quality of the reads, possibly removing low-quality reads.
 +
 
 +
 
  
 
= Viewing fastq read files =
 
= Viewing fastq read files =
  
In order just to get a glimpse at the sequencing data, standard text commands can be used on Abel, such as <span style="font-family:courier new,courier,monospace;">less</span>, <span style="font-family:courier new,courier,monospace;">more</span> or <span style="font-family:courier new,courier,monospace;">cat</span>. Since the read files often are compressed (usually ending in a "<span style="font-family:courier new,courier,monospace;">.gz</span>" extension), special text commands can be used instead. Therefore, the read files do not have to be unzipped before usage (also, most bioinformatic programs working with fastq read files will accept gz-compressed files directly). Possible text commands include <span style="font-family:courier new,courier,monospace;">zmore</span>, <span style="font-family:courier new,courier,monospace;">zcat</span> and <span style="font-family:courier new,courier,monospace;">less</span> (the latter can thus be used both for compressed and un-compressed text data).  
+
In order just to get a glimpse at the sequencing data, standard text commands can be used on Abel, such as <span style="font-family:courier new,courier,monospace;">less</span>, <span style="font-family:courier new,courier,monospace;">more</span> or <span style="font-family:courier new,courier,monospace;">cat</span>. Since the read files often are compressed (usually ending in a "<span style="font-family:courier new,courier,monospace;">.gz</span>" extension), special text commands can be used instead. Therefore, the read files do not have to be unzipped before usage (also, most bioinformatic programs working with fastq read files will accept gz-compressed files directly). Possible text commands include <span style="font-family:courier new,courier,monospace;">zmore</span>, <span style="font-family:courier new,courier,monospace;">zcat</span> and <span style="font-family:courier new,courier,monospace;">less</span> (the latter can thus be used both for compressed and un-compressed text data).
  
 
= Generating quality reports =
 
= Generating quality reports =
  
 
The <span style="font-family:courier new,courier,monospace;">fastaqc</span> program can be used to generate quality reports for fastq sequencing data. This program is installed on Abel, and can be used as follows:
 
The <span style="font-family:courier new,courier,monospace;">fastaqc</span> program can be used to generate quality reports for fastq sequencing data. This program is installed on Abel, and can be used as follows:
 
+
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;">
module load fastqc  
+
module load fastqc
  
 
fastqc /path/to/nameOfReadFile.fastq
 
fastqc /path/to/nameOfReadFile.fastq
 +
</div>
 +
(Note: <span style="font-family:courier new,courier,monospace;">fastqc</span> also accepts gz-compressed read files; i.e. the read file does not have to be unzipped before use).
  
(Note: <span style="font-family:courier new,courier,monospace;">fastqc</span> also accepts gz-compressed read files; i.e. the read file does not have to be unzipped before use).
+
Executing the fastqc program only takes seconds even for large fastq read files; it is not necessary to start a job on Abel for this. Doing this directly from the Abel front-end should not pose any problems; alternatively you can use freebee. The "<span style="font-family:courier new,courier,monospace;">/path/to</span>" folder will now contain two new, fastqc-generated files: <span style="font-family:courier new,courier,monospace;">nameOfReadFile_fastq.html</span> and <span style="font-family:courier new,courier,monospace;">nameOfReadFile_fastq.zip</span>. The <span style="font-family:courier new,courier,monospace;">*.html</span> file contains the quality report that can be viewed in a browser; the <span style="font-family:courier new,courier,monospace;">*.zip</span> file contains the same information in a zipped format. To view this report, either copy the <span style="font-family:courier new,courier,monospace;">*.html</span> file over to your local machine and open it in a borwse, alternatively you can log onto Abel using the X11 system and open a browser on Abel directly (see here for a guide).
 
 
Executing the fastqc program only takes seconds even for large fastq read files; it is not necessary to start a job on Abel for this. Doing this directly from the Abel front-end should not pose any problems; alternatively you can use freebee. The "<span style="font-family:courier new,courier,monospace;">/path/to</span>" folder will now contain two new, fastqc-generated files: <span style="font-family:courier new,courier,monospace;">nameOfReadFile_fastq.html</span> and <span style="font-family:courier new,courier,monospace;">nameOfReadFile_fastq.zip</span>. The <span style="font-family:courier new,courier,monospace;">*.html</span> file contains the quality report that can be viewed in a browser; the <span style="font-family:courier new,courier,monospace;">*.zip</span> file contains the same information in a zipped format. To view this report, either copy the <span style="font-family:courier new,courier,monospace;">*.html</span> file over to your local machine and open it in a borwse, alternatively you can log onto Abel using the X11 system and open a browser on Abel directly (see here for a guide).  
 
  
 
= Trimming low-quality reads =
 
= Trimming low-quality reads =
  
If the fastqc quality report has revealed low-quality reads (or sections of poor qualities in otherwise good reads), the <span style="font-family:courier new,courier,monospace;">Trimmomatic</span> program ([http://www.usadellab.org/cms/?page=trimmomatic http://www.usadellab.org/cms/?page=trimmomatic]) will remove entire low-quality reads, or delete the low-quality part of reads. This program is not installed on Abel.  
+
If the fastqc quality report has revealed low-quality reads (or sections of poor qualities in otherwise good reads), the <span style="font-family:courier new,courier,monospace;">Trimmomatic</span> program ([http://www.usadellab.org/cms/?page=trimmomatic http://www.usadellab.org/cms/?page=trimmomatic]) will remove entire low-quality reads, or delete the low-quality part of reads. This program is not installed on Abel.
  
In order to download and install it on Abel, execute:  
+
In order to download and install it on Abel, execute:
 +
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;">
 +
mkdir trimmomatic
  
mkdir trimmomatic  
+
cd trimmomatic
  
cd trimmomatic
+
wget [http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.33.zip http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.33.zip]
 
 
wget [http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.33.zip http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.33.zip]  
 
  
 
unzip Trimmomatic-0.33.zip
 
unzip Trimmomatic-0.33.zip
 
+
</div>
 
+
<br/>In order to execute <span style="font-family:courier new,courier,monospace;">Trimmomatic</span>, step into the "<span style="font-family:courier new,courier,monospace;">Trimmomatic-0.33</span>" sub-folder and use Java to start the .jar file:
 
+
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;">
In order to execute <span style="font-family:courier new,courier,monospace;">Trimmomatic</span>, step into the "<span style="font-family:courier new,courier,monospace;">Trimmomatic-0.33</span>" sub-folder and use Java to start the .jar file:  
+
cd Trimmomatic-0.33
 
 
 
 
 
 
cd Trimmomatic-0.33  
 
  
 
java -jar trimmomatic-0.33.jar
 
java -jar trimmomatic-0.33.jar
 
+
</div>
 
+
<br/>See the Trimmomatic web page for examples of how to use the program. Note that you can pass in several read files at once - you do not have to run Trimmomatic repeatedly if you have many read files. It is interesting to see the choice of quality cut-off values used on that web page - these example values are much lower than what fastqc suggests are acceptable scores. If in doubt about which values to use, consult the wikipedia page for the precise meaning of phred quality scores: [http://en.wikipedia.org/wiki/Phred_quality_score http://en.wikipedia.org/wiki/Phred_quality_score].
 
 
See the Trimmomatic web page for examples of how to use the program. Note that you can pass in several read files at once - you do not have to run Trimmomatic repeatedly if you have many read files. It is interesting to see the choice of quality cut-off values used on that web page - these example values are much lower than what fastqc suggests are acceptable scores. If in doubt about which values to use, consult the wikipedia page for the precise meaning of phred quality scores: [http://en.wikipedia.org/wiki/Phred_quality_score http://en.wikipedia.org/wiki/Phred_quality_score].
 
  
 
= Assessing the effects of quality trimming =
 
= Assessing the effects of quality trimming =

Revision as of 14:03, 27 May 2015

Before mapping reads to a reference sequence, it is important to assess the quality of the reads, possibly removing low-quality reads.


Viewing fastq read files

In order just to get a glimpse at the sequencing data, standard text commands can be used on Abel, such as less, more or cat. Since the read files often are compressed (usually ending in a ".gz" extension), special text commands can be used instead. Therefore, the read files do not have to be unzipped before usage (also, most bioinformatic programs working with fastq read files will accept gz-compressed files directly). Possible text commands include zmore, zcat and less (the latter can thus be used both for compressed and un-compressed text data).

Generating quality reports

The fastaqc program can be used to generate quality reports for fastq sequencing data. This program is installed on Abel, and can be used as follows:

module load fastqc

fastqc /path/to/nameOfReadFile.fastq

(Note: fastqc also accepts gz-compressed read files; i.e. the read file does not have to be unzipped before use).

Executing the fastqc program only takes seconds even for large fastq read files; it is not necessary to start a job on Abel for this. Doing this directly from the Abel front-end should not pose any problems; alternatively you can use freebee. The "/path/to" folder will now contain two new, fastqc-generated files: nameOfReadFile_fastq.html and nameOfReadFile_fastq.zip. The *.html file contains the quality report that can be viewed in a browser; the *.zip file contains the same information in a zipped format. To view this report, either copy the *.html file over to your local machine and open it in a borwse, alternatively you can log onto Abel using the X11 system and open a browser on Abel directly (see here for a guide).

Trimming low-quality reads

If the fastqc quality report has revealed low-quality reads (or sections of poor qualities in otherwise good reads), the Trimmomatic program (http://www.usadellab.org/cms/?page=trimmomatic) will remove entire low-quality reads, or delete the low-quality part of reads. This program is not installed on Abel.

In order to download and install it on Abel, execute:

mkdir trimmomatic

cd trimmomatic

wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.33.zip

unzip Trimmomatic-0.33.zip


In order to execute Trimmomatic, step into the "Trimmomatic-0.33" sub-folder and use Java to start the .jar file:

cd Trimmomatic-0.33

java -jar trimmomatic-0.33.jar


See the Trimmomatic web page for examples of how to use the program. Note that you can pass in several read files at once - you do not have to run Trimmomatic repeatedly if you have many read files. It is interesting to see the choice of quality cut-off values used on that web page - these example values are much lower than what fastqc suggests are acceptable scores. If in doubt about which values to use, consult the wikipedia page for the precise meaning of phred quality scores: http://en.wikipedia.org/wiki/Phred_quality_score.

Assessing the effects of quality trimming

After removing low-quality reads, it is a good idea to assess how many reads and characters were removed from your read files. If loosing a great amount of data, possibly the quality trimming should be repeated with less stringent values (ideally, obviously, one should decide a priori on acceptable quality values, and stick with these...)

One simple thing to do is to compare the read file sizes before and after trimming. Use the "ls -lh" command so as to get better human-readable file size numbers.

This will tell you how much data was lost. However, losses could occur either by removing (relatively few) entire reads, or alternatively the trimming of relatively many reads. In order to understand exactly what has happened, the number of fastq records before and after quality trimming can be compared:

zcat /path/to/readFile.fastq.gz | grep "^+$" | wc

(Here, the grep command, reading the un-compressed fastq file, is looking for line consisting only of the "+" character, used in the fastq format to separate the sequence from the quality values. These occurencies are passed into the word-count command wc.)