https://wiki.uio.no/mn/ibv/bioinfwiki/api.php?action=feedcontributions&user=Ralfne%40uio.no&feedformat=atommn/ibv/bioinfwiki - User contributions [en]2024-03-28T16:57:53ZUser contributionsMediaWiki 1.27.4https://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=Removing_singletons&diff=265Removing singletons2015-08-24T12:20:48Z<p>Ralfne@uio.no: </p>
<hr />
<div>It is often desirable to remove clusters with few sequences (either singletons, i.e. clusters with only one sequence, or clusters with a low number of sequences) . This can be done using USEARCH.<br />
<br />
Alternatively, our in-house Java scrip 'OTUClusterSizeFiltering' can be used to remove such clusters. This program will run on Linux, Windows or MacOS. Download this script [http://www.mn.uio.no/ibv/bioportal/software/otuclustersizefiltering/ here], and run it without arguments to get usage information:<br />
java -Xmx2G -jar OTUClusterSizeFiltering.jar<br />
(The -Xmx2G option allows the program to use 2 gigabyte of memory, increase it if necessary)</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=Removing_singletons&diff=264Removing singletons2015-08-24T12:20:00Z<p>Ralfne@uio.no: </p>
<hr />
<div>It is often desirable to remove clusters with few sequences (either singletons, i.e. clusters with only one sequence, or clusters with a low number of sequences) . This can be done using USEARCH.<br />
<br />
Alternatively, our in-house Java scrip 'OTUClusterSizeFiltering' can be used to remove such clusters. This program will run on Linux, Windows or MacOS. Download this script [[here]], and run it without arguments to get usage information:<br />
java -Xmx2G -jar OTUClusterSizeFiltering.jar<br />
(The -Xmx2G option allows the program to use 2 gigabyte of memory, increase it if necessary)</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=Removing_singletons&diff=263Removing singletons2015-08-24T12:12:27Z<p>Ralfne@uio.no: </p>
<hr />
<div>It is often desirable to remove clusters with few sequences (either singletons, i.e. clusters with only one sequence, or clusters with a low number of sequences) . This can be done using USEARCH.<br />
<br />
Alternatively, our in-house Java scrip 'OTUClusterSizeFiltering' can be used to remove such clusters. This program will run on Linux, Windows or MacOS. Download this script here, and run it without arguments to get usage information:<br />
java -Xmx2G -jar OTUClusterSizeFiltering.jar<br />
(The -Xmx2G option allows the program to use 2 gigabyte of memory, increase it if necessary)</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=Removing_singletons&diff=262Removing singletons2015-08-24T12:09:54Z<p>Ralfne@uio.no: Created page with "It is often desirable to remove clusters with few sequences (either singletons, i.e. clusters with only one sequence, or clusters with a low number of sequences) . This can be..."</p>
<hr />
<div>It is often desirable to remove clusters with few sequences (either singletons, i.e. clusters with only one sequence, or clusters with a low number of sequences) . This can be done using USEARCH.<br />
<br />
Alternatively, our in-house Java scrip 'OTUClusterSizeFiltering' can be used to remove such clusters. This program will run on Linux, Windows or MacOS. Download this script here, and run it without arguments to get usage information:<br />
<br />
java -Xmx2G -jar OTUClusterSizeFiltering.jar<br />
<br />
(The -Xmx2G option allows the program to use 2 gigabyte of memory, increase it if necessary)</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=Amplicon_sequencing&diff=261Amplicon sequencing2015-08-24T12:09:37Z<p>Ralfne@uio.no: </p>
<hr />
<div>Amplicon sequencing and downstream metagenetic analysis<br />
<br />
Often, Illumina amplicon sequencing is used to assess the diversity and composition of microbial populations. For many purposes, the entire downstream bioinformatic analysis may be carried out using only the Qiime pipeline. However, for unusual or very large data, some steps will not work to satisfaction. Hence, alternative bioinformatic programs may have to be found.<br />
<br />
This wiki contains a quick overview over the typical steps included in the analysis of amplicon sequencing data. Also,<br />
some of the alternative programs are listed, together with reasons to use them, rather than Qiime. <br />
<br />
If using Qiime, consult the documentation at http://qiime.org . Qiime is installed on Abel; the default version is v1.5.0, but versions v1.8.0 and v1.9.1 are available as well. Load the default version of Qiime as<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
module load qiime <br />
<br />
or the newest verion as<br />
<br />
module load qiime/1.9.1<br />
</div><br />
<br />
<br />
[[Quality control of fastq read files]]<br />
<br />
[[Paired-end read merging]]<br />
<br />
Removal of PCR primer sequences<br />
<br />
Converting fastq files into fasta files<br />
<br />
De-replication of reads<br />
<br />
[[Clustering of reads]]<br />
<br />
[[Removing singletons]]<br />
<br />
Selecting representative sequences for clusters<br />
<br />
Chimera-checking<br />
<br />
Assigning taxonomical identifiers<br />
<br />
Analysing OTU tables</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=Clustering_of_reads&diff=260Clustering of reads2015-08-24T12:02:19Z<p>Ralfne@uio.no: </p>
<hr />
<div>Qiime does not contain an internal clustering algorithm. Rather, an external program must be specified; Qiime will send the data to be clustered to the sepcified program and subsequently import the clusters back into Qiime. <br />
<br />
Here, the user might experience problems if using high-volume sequencing data. The free 32-bit version of the USEARCH algorithm often as the external algorithm cannot deal with massive sequencing data (the USEARCH 64-bit version is not free and is not installed on Abel).<br />
<br />
The vsearch program may be used for clustering instead. See<br />
<br />
https://github.com/torognes/vsearch<br />
<br />
for information about this algorithm. This program is not installed on Abel; to use it, download it and grant executive permission to the program. Execute the program without arguments to display the usage information:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
wget https://github.com/torognes/vsearch/releases/download/v1.1.3/vsearch-1.1.3-linux-x86_64<br />
<br />
chmod 755 vsearch-1.1.3-linux-x86_64<br />
<br />
./vsearch-1.1.3-linux-x86_64<br />
</div><br />
<br />
The latest version of Qiime (v1.9.1) allows the usage of the SWARM algorithm instead of USEARCH. This program, however, is not installed on Abel yet.</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=Clustering_of_reads&diff=259Clustering of reads2015-08-24T12:01:26Z<p>Ralfne@uio.no: Created page with "Qiime does not contain an internal clustering algorithm. Rather, an external program must be specified; Qiime will send the data to be clustered to the sepcified program and s..."</p>
<hr />
<div>Qiime does not contain an internal clustering algorithm. Rather, an external program must be specified; Qiime will send the data to be clustered to the sepcified program and subsequently import the clusters back into Qiime. <br />
<br />
Here, the user might experience problems if using high-volume sequencing data. The free 32-bit version of the USEARCH algorithm often as the external algorithm cannot deal with massive sequencing data (the USEARCH 64-bit version is not free and is not installed on Abel).<br />
<br />
The vsearch program may be used for clustering instead. See<br />
<br />
https://github.com/torognes/vsearch<br />
<br />
for information about this algorithm. This program is not installed on Abel; to use it, download it and grant executive permission to the program. Execute the program without arguments to display the usage information:<br />
<br />
wget https://github.com/torognes/vsearch/releases/download/v1.1.3/vsearch-1.1.3-linux-x86_64<br />
<br />
chmod 755 vsearch-1.1.3-linux-x86_64<br />
<br />
./vsearch-1.1.3-linux-x86_64<br />
<br />
<br />
The latest version of Qiime (v1.9.1) allows the usage of the SWARM algorithm instead of USEARCH. This program, however, is not installed on Abel yet.</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=Amplicon_sequencing&diff=258Amplicon sequencing2015-08-24T12:01:07Z<p>Ralfne@uio.no: </p>
<hr />
<div>Amplicon sequencing and downstream metagenetic analysis<br />
<br />
Often, Illumina amplicon sequencing is used to assess the diversity and composition of microbial populations. For many purposes, the entire downstream bioinformatic analysis may be carried out using only the Qiime pipeline. However, for unusual or very large data, some steps will not work to satisfaction. Hence, alternative bioinformatic programs may have to be found.<br />
<br />
This wiki contains a quick overview over the typical steps included in the analysis of amplicon sequencing data. Also,<br />
some of the alternative programs are listed, together with reasons to use them, rather than Qiime. <br />
<br />
If using Qiime, consult the documentation at http://qiime.org . Qiime is installed on Abel; the default version is v1.5.0, but versions v1.8.0 and v1.9.1 are available as well. Load the default version of Qiime as<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
module load qiime <br />
<br />
or the newest verion as<br />
<br />
module load qiime/1.9.1<br />
</div><br />
<br />
<br />
[[Quality control of fastq read files]]<br />
<br />
[[Paired-end read merging]]<br />
<br />
Removal of PCR primer sequences<br />
<br />
Converting fastq files into fasta files<br />
<br />
De-replication of reads<br />
<br />
[[Clustering of reads]]<br />
<br />
Removing singletons<br />
<br />
Selecting representative sequences for clusters<br />
<br />
Chimera-checking<br />
<br />
Assigning taxonomical identifiers<br />
<br />
Analysing OTU tables</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=Paired-end_read_merging&diff=257Paired-end read merging2015-08-24T11:59:29Z<p>Ralfne@uio.no: </p>
<hr />
<div>Paired-end read merging can be performed with the Paired-End reAd mergeR (PEAR) program, see <br />
<br />
https://github.com/xflouris/PEAR<br />
<br />
for more information.<br />
<br />
In order to use PEAR on Abel, download the 64 bit precompiled program file, unzip it and run the program without arguments to display usage information:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
wget http://sco.h-its.org/exelixis/web/software/pear/files/pear-0.9.6-bin-64.tar.gz<br />
<br />
tar -xzvf pear-0.9.6-bin-64.tar.gz<br />
<br />
./pear-0.9.6-bin-64<br />
</div></div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=Paired-end_read_merging&diff=256Paired-end read merging2015-08-24T11:58:51Z<p>Ralfne@uio.no: Created page with "Paired-end read merging can be performed with the Paired-End reAd mergeR (PEAR) program, see https://github.com/xflouris/PEAR for more information. In order to use PEAR on..."</p>
<hr />
<div>Paired-end read merging can be performed with the Paired-End reAd mergeR (PEAR) program, see <br />
<br />
https://github.com/xflouris/PEAR<br />
<br />
for more information.<br />
<br />
In order to use PEAR on Abel, download the 64 bit precompiled program file, unzip it and run the program without arguments to display usage information:<br />
<br />
wget http://sco.h-its.org/exelixis/web/software/pear/files/pear-0.9.6-bin-64.tar.gz<br />
<br />
tar -xzvf pear-0.9.6-bin-64.tar.gz<br />
<br />
./pear-0.9.6-bin-64</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=Amplicon_sequencing&diff=255Amplicon sequencing2015-08-24T11:58:25Z<p>Ralfne@uio.no: </p>
<hr />
<div>Amplicon sequencing and downstream metagenetic analysis<br />
<br />
Often, Illumina amplicon sequencing is used to assess the diversity and composition of microbial populations. For many purposes, the entire downstream bioinformatic analysis may be carried out using only the Qiime pipeline. However, for unusual or very large data, some steps will not work to satisfaction. Hence, alternative bioinformatic programs may have to be found.<br />
<br />
This wiki contains a quick overview over the typical steps included in the analysis of amplicon sequencing data. Also,<br />
some of the alternative programs are listed, together with reasons to use them, rather than Qiime. <br />
<br />
If using Qiime, consult the documentation at http://qiime.org . Qiime is installed on Abel; the default version is v1.5.0, but versions v1.8.0 and v1.9.1 are available as well. Load the default version of Qiime as<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
module load qiime <br />
<br />
or the newest verion as<br />
<br />
module load qiime/1.9.1<br />
</div><br />
<br />
<br />
[[Quality control of fastq read files]]<br />
<br />
[[Paired-end read merging]]<br />
<br />
-Removal of PCR primer sequences<br />
<br />
-Converting fastq files into fasta files<br />
<br />
-De-replication of reads<br />
<br />
-Clustering of reads<br />
<br />
-Removing singletons<br />
<br />
-Selecting representative sequences for clusters<br />
<br />
-Chimera-checking<br />
<br />
-Assigning taxonomical identifiers<br />
<br />
-Analysing OTU tables</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=Quality_control_of_fastq_read_files&diff=254Quality control of fastq read files2015-08-24T11:57:34Z<p>Ralfne@uio.no: </p>
<hr />
<div>The first step of any analysis involving high-throughput sequencing data should consist of assessing the quality of the read data, possibly followed by the removal of low-quality reads.<br />
<br />
It is possible to filter sequencing reads using Qiime, but a comprehensive report is not produced in the process. The FastQC program may be used to produce such a report, optionally followed by read filtering using the Trimmomatic program.<br />
<br />
See [https://wiki.uio.no/mn/ibv/bioinfwiki/index.php/RNASeq:_Quality_control here] for an overview.</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=Quality_control_of_fastq_read_files&diff=253Quality control of fastq read files2015-08-24T11:57:23Z<p>Ralfne@uio.no: </p>
<hr />
<div>The first step of any analysis involving high-throughput sequencing data should consist of assessing the quality of the read data, possibly followed by the removal of low-quality reads.<br />
<br />
<br />
It is possible to filter sequencing reads using Qiime, but a comprehensive report is not produced in the process. The FastQC program may be used to produce such a report, optionally followed by read filtering using the Trimmomatic program.<br />
<br />
See [https://wiki.uio.no/mn/ibv/bioinfwiki/index.php/RNASeq:_Quality_control here] for an overview.</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=Quality_control_of_fastq_read_files&diff=252Quality control of fastq read files2015-08-24T11:57:00Z<p>Ralfne@uio.no: </p>
<hr />
<div>The first step of any analysis involving high-throughput sequencing data should consist of assessing the quality of the read data, possibly followed by the removal of low-quality reads.<br />
<br />
<br />
It is possible to filter sequencing reads using Qiime, but a comprehensive report is not produced in the process. The FastQC program may be used to produce such a report, optionally followed by read filtering using the Trimmomatic program.<br />
<br />
See [https://wiki.uio.no/mn/ibv/bioinfwiki/index.php/RNASeq:_Quality_control here] for an overview.<br />
<br />
<br />
https://wiki.uio.no/mn/ibv/bioinfwiki/index.php/RNASeq:_Quality_control</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=Quality_control_of_fastq_read_files&diff=251Quality control of fastq read files2015-08-24T11:56:14Z<p>Ralfne@uio.no: Created page with "The first step of any analysis involving high-throughput sequencing data should consist of assessing the quality of the read data, possibly followed by the removal of low-qua..."</p>
<hr />
<div>The first step of any analysis involving high-throughput sequencing data should consist of assessing the quality of the read data, possibly followed by the removal of low-quality reads.<br />
<br />
<br />
It is possible to filter sequencing reads using Qiime, but a comprehensive report is not produced in the process. The FastQC program may be used to produce such a report, optionally followed by read filtering using the Trimmomatic program.<br />
<br />
See here for an overview.<br />
<br />
<br />
https://wiki.uio.no/mn/ibv/bioinfwiki/index.php/RNASeq:_Quality_control</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=Amplicon_sequencing&diff=250Amplicon sequencing2015-08-24T11:55:51Z<p>Ralfne@uio.no: </p>
<hr />
<div>Amplicon sequencing and downstream metagenetic analysis<br />
<br />
Often, Illumina amplicon sequencing is used to assess the diversity and composition of microbial populations. For many purposes, the entire downstream bioinformatic analysis may be carried out using only the Qiime pipeline. However, for unusual or very large data, some steps will not work to satisfaction. Hence, alternative bioinformatic programs may have to be found.<br />
<br />
This wiki contains a quick overview over the typical steps included in the analysis of amplicon sequencing data. Also,<br />
some of the alternative programs are listed, together with reasons to use them, rather than Qiime. <br />
<br />
If using Qiime, consult the documentation at http://qiime.org . Qiime is installed on Abel; the default version is v1.5.0, but versions v1.8.0 and v1.9.1 are available as well. Load the default version of Qiime as<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
module load qiime <br />
<br />
or the newest verion as<br />
<br />
module load qiime/1.9.1<br />
</div><br />
<br />
<br />
[[Quality control of fastq read files]]<br />
<br />
-Paired-end read merging<br />
<br />
-Removal of PCR primer sequences<br />
<br />
-Converting fastq files into fasta files<br />
<br />
-De-replication of reads<br />
<br />
-Clustering of reads<br />
<br />
-Removing singletons<br />
<br />
-Selecting representative sequences for clusters<br />
<br />
-Chimera-checking<br />
<br />
-Assigning taxonomical identifiers<br />
<br />
-Analysing OTU tables</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=Amplicon_sequencing&diff=249Amplicon sequencing2015-08-24T11:54:27Z<p>Ralfne@uio.no: </p>
<hr />
<div>Amplicon sequencing and downstream metagenetic analysis<br />
<br />
Often, Illumina amplicon sequencing is used to assess the diversity and composition of microbial populations. For many purposes, the entire downstream bioinformatic analysis may be carried out using only the Qiime pipeline. However, for unusual or very large data, some steps will not work to satisfaction. Hence, alternative bioinformatic programs may have to be found.<br />
<br />
This wiki contains a quick overview over the typical steps included in the analysis of amplicon sequencing data. Also,<br />
some of the alternative programs are listed, together with reasons to use them, rather than Qiime. <br />
<br />
If using Qiime, consult the documentation at http://qiime.org . Qiime is installed on Abel; the default version is v1.5.0, but versions v1.8.0 and v1.9.1 are available as well. Load the default version of Qiime as<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
module load qiime <br />
<br />
or the newest verion as<br />
<br />
module load qiime/1.9.1<br />
</div><br />
<br />
<br />
[[Quality check of fastq read files]]<br />
<br />
-Paired-end read merging<br />
<br />
-Removal of PCR primer sequences<br />
<br />
-Converting fastq files into fasta files<br />
<br />
-De-replication of reads<br />
<br />
-Clustering of reads<br />
<br />
-Removing singletons<br />
<br />
-Selecting representative sequences for clusters<br />
<br />
-Chimera-checking<br />
<br />
-Assigning taxonomical identifiers<br />
<br />
-Analysing OTU tables</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=Amplicon_sequencing&diff=248Amplicon sequencing2015-08-24T11:53:35Z<p>Ralfne@uio.no: </p>
<hr />
<div>Amplicon sequencing and downstream metagenetic analysis<br />
<br />
Often, Illumina amplicon sequencing is used to assess the diversity and composition of microbial populations. For many purposes, the entire downstream bioinformatic analysis may be carried out using only the Qiime pipeline. However, for unusual or very large data, some steps will not work to satisfaction. Hence, alternative bioinformatic programs may have to be found.<br />
<br />
This wiki contains a quick overview over the typical steps included in the analysis of amplicon sequencing data. Also,<br />
some of the alternative programs are listed, together with reasons to use them, rather than Qiime. <br />
<br />
If using Qiime, consult the documentation at http://qiime.org . Qiime is installed on Abel; the default version is v1.5.0, but versions v1.8.0 and v1.9.1 are available as well. Load the default version of Qiime as<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
module load qiime <br />
<br />
or the newest verion as<br />
<br />
module load qiime/1.9.1<br />
</div><br />
<br />
<br />
-Quality check of fastq read files<br />
<br />
-Paired-end read merging<br />
<br />
-Removal of PCR primer sequences<br />
<br />
-Converting fastq files into fasta files<br />
<br />
-De-replication of reads<br />
<br />
-Clustering of reads<br />
<br />
-Removing singletons<br />
<br />
-Selecting representative sequences for clusters<br />
<br />
-Chimera-checking<br />
<br />
-Assigning taxonomical identifiers<br />
<br />
-Analysing OTU tables</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=Amplicon_sequencing&diff=247Amplicon sequencing2015-08-24T11:52:33Z<p>Ralfne@uio.no: Created page with "Amplicon sequencing and downstream metagenetic analysis Often, Illumina amplicon sequencing is used to assess the diversity and composition of microbial populations. For many..."</p>
<hr />
<div>Amplicon sequencing and downstream metagenetic analysis<br />
<br />
Often, Illumina amplicon sequencing is used to assess the diversity and composition of microbial populations. For many purposes, the entire downstream bioinformatic analysis may be carried out using only the Qiime pipeline. However, for unusual or very large data, some steps will not work to satisfaction. Hence, alternative bioinformatic programs may have to be found.<br />
<br />
This wiki contains a quick overview over the typical steps included in the analysis of amplicon sequencing data. Also,<br />
some of the alternative programs are listed, together with reasons to use them, rather than Qiime. <br />
<br />
If using Qiime, consult the documentation at http://qiime.org . Qiime is installed on Abel; the default version is v1.5.0, but versions v1.8.0 and v1.9.1 are available as well. Load the default version of Qiime as<br />
<br />
module load qiime <br />
<br />
or the newest verion as<br />
<br />
module load qiime/1.9.1<br />
<br />
<br />
<br />
<br />
-Quality check of fastq read files<br />
<br />
-Paired-end read merging<br />
<br />
-Removal of PCR primer sequences<br />
<br />
-Converting fastq files into fasta files<br />
<br />
-De-replication of reads<br />
<br />
-Clustering of reads<br />
<br />
-Removing singletons<br />
<br />
-Selecting representative sequences for clusters<br />
<br />
-Chimera-checking<br />
<br />
-Assigning taxonomical identifiers<br />
<br />
-Analysing OTU tables</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=Main_Page&diff=246Main Page2015-08-24T11:51:55Z<p>Ralfne@uio.no: </p>
<hr />
<div>#<span style="font-size:large;">[[BLAST tutorial|BLAST tutorial]]</span><br />
#<span style="font-size:large;">[[PacBio sequencing|PacBio sequencing]]</span><br />
#[[RNASeq_and_differential_gene_expression_analysis|<span style="font-size:large;">RNASeq and differential gene expression analysis</span>]]<br />
#<span style="font-size:large;">[[Amplicon sequencing|Amplicon sequencing]]</span></div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=Main_Page&diff=245Main Page2015-08-24T11:50:33Z<p>Ralfne@uio.no: </p>
<hr />
<div>#<span style="font-size:large;">[[BLAST tutorial|BLAST tutorial]]</span><br />
#<span style="font-size:large;">[[PacBio sequencing|PacBio sequencing]]</span><br />
#[[RNASeq_and_differential_gene_expression_analysis|<span style="font-size:large;">RNASeq and differential gene expression analysis</span>]]<br />
#Amplicon sequencing</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq:_Differential_gene_expression_analysis&diff=244RNASeq: Differential gene expression analysis2015-05-28T11:45:41Z<p>Ralfne@uio.no: </p>
<hr />
<div>= Introduction =<br />
<br />
After mapping reads to a reference sequence and obtaining the count data, the differential gene expression analysis will determine whether differences in count data are likely to be reflect true differences in sample conditions. The details of performing this analysis are described in the two tutorials listed [[RNASeq and differential gene expression analysis|here]].<br />
<br />
= Using R on Abel =<br />
<br />
The INFBIO9120 tutorial is using the R library DESeq for gene expression analysis. The newer library DESeq2 is used in the second tutorial. Both of these libraries are available on Abel.<br />
<br />
In order to start R and load the DESeq (or DESeq2) library on Abel, type:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
module load R<br />
<br />
R<br />
<br />
library(DESeq)<br />
</div><br />
The second tutorial is using the pasilla dataset. This dataset is not part of the R installation on Abel. To install it, use:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
source("[http://bioconductor.org/biocLite.R http://bioconductor.org/biocLite.R]")<br />
<br />
biocLite("pasilla")<br />
</div></div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq:_Differential_gene_expression_analysis&diff=243RNASeq: Differential gene expression analysis2015-05-28T11:45:02Z<p>Ralfne@uio.no: </p>
<hr />
<div>= Introduction =<br />
<br />
After mapping reads to a reference sequence and obtaining the count data, the differential gene expression analysis will determine whether differences in count data are likely to be reflect true differences in sample conditions. The details of performing this analysis are described in the two tutorials listed [[RNASeq and differential gene expression analysis|here]].<br />
<br />
= Using R on Abel =<br />
<br />
The INFBIO9120 tutorial is using the R library DESeq for gene expression analysis. The newer library DESeq2 is used in the second tutorial. Both of these libraries are available on Abel.<br />
<br />
In order to start R and load the DESeq (or DESeq2) library on Abel, type:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
<br/>module load R<br />
<br />
R<br />
<br />
library(DESeq)<br />
</div><br />
<br/>The second tutorial is using the pasilla dataset. This dataset is not part of the R installation on Abel. To install it, use:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
source("[http://bioconductor.org/biocLite.R http://bioconductor.org/biocLite.R]")<br />
<br />
biocLite("pasilla")<br />
</div></div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq:_Differential_gene_expression_analysis&diff=242RNASeq: Differential gene expression analysis2015-05-28T11:44:12Z<p>Ralfne@uio.no: Created page with "= Introduction = After mapping reads to a reference sequence and obtaining the count data, the differential gene expression analysis will determine whether differences in co..."</p>
<hr />
<div>= Introduction =<br />
<br />
After mapping reads to a reference sequence and obtaining the count data, the differential gene expression analysis will determine whether differences in count data are likely to be reflect true differences in sample conditions. The details of performing this analysis are described in the two tutorials listed [[RNASeq_and_differential_gene_expression_analysis|here]].<br />
<br />
= Using R on Abel =<br />
<br />
The INFBIO9120 tutorial is using the R library DESeq for gene expression analysis. The newer library DESeq2 is used in the second tutorial. Both of these libraries are available on Abel.<br />
<br />
In order to start R and load the DESeq (or DESeq2) library on Abel, type:<br />
<br />
<br/>module load R <br />
<br />
R <br />
<br />
library(DESeq)<br />
<br />
<br />
<br />
The second tutorial is using the pasilla dataset. This dataset is not part of the R installation on Abel. To install it, use:<br />
<br />
source("[http://bioconductor.org/biocLite.R http://bioconductor.org/biocLite.R]") <br />
<br />
biocLite("pasilla")</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq_and_differential_gene_expression_analysis&diff=241RNASeq and differential gene expression analysis2015-05-28T11:42:46Z<p>Ralfne@uio.no: </p>
<hr />
<div><div>Differential gene expression analysis using RNASeq implies obtaining RNA sequencing data for the conditions to be compared, mapping the RNA reads to the relevant genome (or transcriptome), counting the read coverage for features-of-interest, and using statistical procedures to infer whether the coverages vary in a systematic and statistical signifincant manner.</div><div><br/></div><div>This section contaIns some technical information for the users of Abel, the UoO high-performance computing cluster. It is not in itself a gene expression analysis tutorial. However, such a tutorial (taken from the UoO course INFBIO9120) is available for download [[Media:INF-BIOx120 RNASeq Analysis.pdf|here]]. This tutorial uses the older "DESeq" R package to do the statistical analysis. The newer "DESeq2" package is used in the following tutorial:</div><div><br/></div><div>[http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2014/140217_AgriOmics/mar_practical.pdf http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2014/140217_AgriOmics/mar_practical.pdf]</div><div><br/></div><br />
*[[RNASeq: Quality control|Quality control of sequencing data]]<br />
*[[RNASeq: Mapping reads to a reference sequence|Mapping reads to a reference sequence]]<br />
*[[RNASeq: Visualizing mapped reads|Visualizing mapped reads]]<br />
*[[RNASeq: Obtaining read counts|Obtaining the read counts]]<br />
*[[RNASeq:_Differential_gene_expression_analysis|Differential gene expression analysis]]<br />
*[[RNASeq: Dealing with stranded sequencing data|Dealing with stranded sequencing data]]</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq:_Obtaining_read_counts&diff=240RNASeq: Obtaining read counts2015-05-28T11:41:37Z<p>Ralfne@uio.no: </p>
<hr />
<div>Read counting implies counting the number of reads that map inside a specific annotation feature. The tutorials listed [[RNASeq and differential gene expression analysis|here]] demonstrate read counting as part of differential gene expression using the R library DESeq/DESeq2. Alternatively, reads may be counted with the python program HTSeq-count, see the manual for instructions ([http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html]).<br />
<br />
Read counting may be CPU-intensive, depending on the size of the BAM file(s) used. It is thus recommended to run this process as a job script on Abel. Such a job script must first load the R module on Abel, subsequently executing an R script containing the read-counting R code. Such a job script may look like:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
<nowiki>#</nowiki>!/bin/bash<br />
<br />
<nowiki>#</nowiki>SBATCH --job-name=my_R_script_name<br />
<br />
<nowiki>#</nowiki>SBATCH --account=myAccountName<br />
<br />
<nowiki>#</nowiki>SBATCH --time=48:00:00<br />
<br />
<nowiki>#</nowiki>SBATCH --mem-per-cpu=3500M<br />
<br />
<nowiki>#</nowiki>SBATCH --nodes=1<br />
<br />
<nowiki>#</nowiki>SBATCH --ntasks-per-node=1<br />
<br />
<br />
<br />
source /cluster/bin/jobsetup<br />
<br />
<br />
<br />
module load R<br />
<br />
R CMD BATCH /path/to/Rscript.R<br />
</div></div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq:_Obtaining_read_counts&diff=239RNASeq: Obtaining read counts2015-05-28T11:39:52Z<p>Ralfne@uio.no: </p>
<hr />
<div>Read counting implies counting the number of reads that map inside a specific annotation feature. The tutorials listed [[RNASeq and differential gene expression analysis|here]] demonstrate read counting as part of differential gene expression using the R library DESeq/DESeq2. Alternatively, reads may be counted with the python program HTSeq-count, see the manual for instructions ([http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html]).<br />
<br />
Read counting may be CPU-intensive, depending on the size of the BAM file(s) used. It is thus recommended to run this process as a job script on Abel. Such a job script must first load the R module on Abel, subsequently executing an R script containing the read-counting R code. Such a job script may look like:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
#!/bin/bash<br />
<br />
#SBATCH --job-name=my_R_script_name<br />
<br />
#SBATCH --account=myAccountName<br />
<br />
#SBATCH --time=48:00:00<br />
<br />
#SBATCH --mem-per-cpu=3500M<br />
<br />
#SBATCH --nodes=1<br />
<br />
#SBATCH --ntasks-per-node=1<br />
<br />
<br />
<br />
source /cluster/bin/jobsetup<br />
<br />
<br />
<br />
module load R <br />
<br />
R CMD BATCH /path/to/Rscript.R<br />
</div></div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq:_Obtaining_read_counts&diff=238RNASeq: Obtaining read counts2015-05-28T11:37:46Z<p>Ralfne@uio.no: Created page with "Read counting implies counting the number of reads that map inside a specific annotation feature. The tutorials listed [[RNASeq_and_differential_gene_expression_analysis|here]..."</p>
<hr />
<div>Read counting implies counting the number of reads that map inside a specific annotation feature. The tutorials listed [[RNASeq_and_differential_gene_expression_analysis|here]] demonstrate read counting as part of differential gene expression using the R library DESeq/DESeq2. Alternatively, reads may be counted with the python program HTSeq-count, see the manual for instructions ([http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html]).<br />
<br />
Read counting may be CPU-intensive, depending on the size of the BAM file(s) used. It is thus recommended to run this process as a job script on Abel. Such a job script must first load the R module on Abel, subsequently executing an R script containing the read-counting R code. Such a job script may look like:<br />
<br />
!/bin/bash<br />
<br />
SBATCH --job-name=my_R_script_name<br />
<br />
SBATCH --account=myAccountName<br />
<br />
SBATCH --time=48:00:00<br />
<br />
SBATCH --mem-per-cpu=3500M<br />
<br />
SBATCH --nodes=1<br />
<br />
SBATCH --ntasks-per-node=1<br />
<br />
source /cluster/bin/jobsetup<br />
<br />
module purge module load R R CMD BATCH /path/to/Rscript.R</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq_and_differential_gene_expression_analysis&diff=237RNASeq and differential gene expression analysis2015-05-28T11:36:11Z<p>Ralfne@uio.no: </p>
<hr />
<div><div>Differential gene expression analysis using RNASeq implies obtaining RNA sequencing data for the conditions to be compared, mapping the RNA reads to the relevant genome (or transcriptome), counting the read coverage for features-of-interest, and using statistical procedures to infer whether the coverages vary in a systematic and statistical signifincant manner.</div><div><br/></div><div>This section contaIns some technical information for the users of Abel, the UoO high-performance computing cluster. It is not in itself a gene expression analysis tutorial. However, such a tutorial (taken from the UoO course INFBIO9120) is available for download [[Media:INF-BIOx120 RNASeq Analysis.pdf|here]]. This tutorial uses the older "DESeq" R package to do the statistical analysis. The newer "DESeq2" package is used in the following tutorial:</div><div><br/></div><div>[http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2014/140217_AgriOmics/mar_practical.pdf http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2014/140217_AgriOmics/mar_practical.pdf]</div><div><br/></div><br />
*[[RNASeq: Quality control|Quality control of sequencing data]]<br />
*[[RNASeq: Mapping reads to a reference sequence|Mapping reads to a reference sequence]]<br />
*[[RNASeq: Visualizing mapped reads|Visualizing mapped reads]]<br />
*[[RNASeq:_Obtaining_read_counts|Obtaining the read counts]]<br />
*Gene expression analysis<br />
*[[RNASeq: Dealing with stranded sequencing data|Dealing with stranded sequencing data]]</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq:_Mapping_reads_to_a_reference_sequence&diff=236RNASeq: Mapping reads to a reference sequence2015-05-28T10:01:54Z<p>Ralfne@uio.no: </p>
<hr />
<div>= Introduction =<br />
<br />
On of the most popular mapping algorithms is the Bowtie program. However, this program cannot map reads that are spanning two exons. In order to enable mapping of such reads, and with the added benefit of being able to confidently annotate exon-intron boundaries, the Tophat program has been created. It uses Bowtie as the underlying mapping algorithm, but is able to assign reads not mapped directly by Bowtie to distinct exons. For details, pleae see the Tophat manual ([https://ccb.jhu.edu/software/tophat/manual.shtml https://ccb.jhu.edu/software/tophat/manual.shtml]).<br />
<br />
= Mapping reads with Tophat =<br />
<br />
Since Tophat uses the Bowtie algorithm for the mapping of reads to a reference sequences, both programs must be installed for Tophat to be working. Both of these program are present on Abel; previous problems in letting the programs communicate have now been solved. However, the user must run the newest Tophat installation on Abel, not the default one.<br />
<br />
== Dealing with spaces in fasta sequence headers ==<br />
<br />
Before starting the mapping process, is is important to check whether the reference sequence fasta headers contain spaces. If so, this could lead to problems later in the mapping and expression analysis process. The reason for this is that some programs treat the fasta header section before the first space as the identifier, the rest is treated as a comment. This seems to be the case on for instance in biopython, or in the samtools programs, which are used by Tophat. Confusingly, the Picard programs that often are used together with samtools treat spaces differently. In order to avoid this problem, it may be a good idea to remove spaces in fasta headers (see [[SMRT_Analysis:_Mapping_reads_to_a_reference|here]]). Notice however, that this may break other naming relationships. For instance, the annotation contained in a GTF file may contain fasta header names with spaces. If so, spaces must be removed from the annotation files as well.<br />
<br />
== Creating the bowtie index ==<br />
<br />
Before starting to map reads with Tophat, an index for the reference sequence must be created. Tophat can use both Bowtie and the newer Bowtie2 programs; it is recommended to use Bowtie2. This means also using Bowtie2 in order to create the sequence index. The <span style="font-family:courier new,courier,monospace;">bowtie2-build</span> program must be given the filename of a fasta file containing the reference sequence. It is important that this file ends with a "<span style="font-family:courier new,courier,monospace;">.fa</span>" extension; rename the file if this is not the case. <span style="font-family:courier new,courier,monospace;">bowtie2-build</span> also must be given the output name. In order to avoid problems later on, the output name should be equal to the input sequence file name, excluding the "<span style="font-family:courier new,courier,monospace;">.fa</span>" extension:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
module load bowtie2<br />
<br />
bowtie2-build /path/to/refSequence.fasta /path/to/refSequence<br />
<br />
ls -l<br />
</div><br />
This will build the index, displaying the index files (files ending with a ".bt2" extension).<br />
<br />
== Running Tophat ==<br />
<br />
A default exection of Tophat needs an output folder, the reference sequence index files, at least one read file:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
module load tophat/2.0.14<br />
<br />
tophat -o /path/to/outputFolder /path/to/refSequence /path/to/readFile.fastq.gz<br />
</div><br />
It is often necessary to include more files and options, for instance, a GTF annotation file can be included so as to guide Tophat in the mapping process. Also, if using paired-end sequencing, the forward and reverse read files must both be provided. See the Tophat manual and the tutorials for details. Especial care must be taken if using strand-specific sequencing data, see here for an overview.<br />
<br />
== Estimating the insert size and standard deviation ==<br />
<br />
If using paired-end sequencing, the <span style="font-family:courier new,courier,monospace;">--mate-inner-dist</span> and the <span style="font-family:courier new,courier,monospace;">--mate-std-dev</span> parameters should be specified. Your sequencing provider should be able to report these values. If not, they can be estimated by performing a mapping using Bowtie, and calculating the values using the <span style="font-family:courier new,courier,monospace;">CollectInsertSizeMetrics</span> program in the Picard package. The Bowtie mapping must be done with the transcriptome, not the genome, as the reference sequence (in order to avoid that introns distort the insert size estimates). If a trancsriptome file is not available, it can be created from a GFF3 annotation file and the genomic sequences file using the gffread utility. This program is part of the cufflinks module which can be loaded as:<br />
<br />
<span style="font-family:courier new,courier,monospace;">module load cufflinks</span><br />
<br />
(cufflinks is not displayed when using the module avail statement, but is nevertheless installed on Abel!)<br />
<br />
See the gffread manual for details: [http://cole-trapnell-lab.github.io/cufflinks/file_formats/#the-gffread-utility http://cole-trapnell-lab.github.io/cufflinks/file_formats/#the-gffread-utility]<br />
<br />
It is not necessary to use all reads for getting an insert size estimate. In order to extract 10000 reads from a fastq.gz read file, use:<br />
<br />
<span style="font-family:courier new,courier,monospace;">zcat /path/to/readFile.fastq.gz | head -40000 > /path/to/outputFile.fastq</span><br />
<br />
Use Bowtie together with the transcriptome and the resulting forward and reverse read files so as to create the BAM file.<br />
<br />
In order to obtain the insert size values, do:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
module load R<br />
<br />
module load picard-tools<br />
<br />
java -Xmx1G -jar /cluster/software/VERSIONS/picard-tools-1.119/bin/CollectInsertSizeMetrics.jar H=/path/to/histogramFile.pdf I=/path/to/bamFile.bam O=/path/to/outputFile.txt<br />
<br />
more /path/to/outputFile.txt<br />
</div><br />
<br/>This will display the insert size metrics, including the average insert size and the standard deviation. The <span style="font-family:courier new,courier,monospace;">/path/to/histogramFile.pdf</span> file can be transferred to the loacl machine and viewed in any PDF viewer.</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq:_Mapping_reads_to_a_reference_sequence&diff=235RNASeq: Mapping reads to a reference sequence2015-05-28T10:01:02Z<p>Ralfne@uio.no: </p>
<hr />
<div>= Introduction =<br />
<br />
On of the most popular mapping algorithms is the Bowtie program. However, this program cannot map reads that are spanning two exons. In order to enable mapping of such reads, and with the added benefit of being able to confidently annotate exon-intron boundaries, the Tophat program has been created. It uses Bowtie as the underlying mapping algorithm, but is able to assign reads not mapped directly by Bowtie to distinct exons. For details, pleae see the Tophat manual ([https://ccb.jhu.edu/software/tophat/manual.shtml https://ccb.jhu.edu/software/tophat/manual.shtml]).<br />
<br />
= Mapping reads with Tophat =<br />
<br />
Since Tophat uses the Bowtie algorithm for the mapping of reads to a reference sequences, both programs must be installed for Tophat to be working. Both of these program are present on Abel; previous problems in letting the programs communicate have now been solved. However, the user must run the newest Tophat installation on Abel, not the default one.<br />
<br />
== Dealing with spaces in fasta sequence headers ==<br />
<br />
Before starting the mapping process, is is important to check whether the reference sequence fasta headers contain spaces. If so, this could lead to problems later in the mapping and expression analysis process. The reason for this is that some programs treat the fasta header section before the first space as the identifier, the rest is treated as a comment. This seems to be the case on for instance in biopython, or in the samtools programs, which are used by Tophat. Confusingly, the Picard programs that often are used together with samtools treat spaces differently. In order to avoid this problem, it may be a good idea to remove spaces in fasta headers (see here SMRT_Analysis:_Mapping_reads_to_a_reference). Notice however, that this may break other naming relationships. For instance, the annotation contained in a GTF file may contain fasta header names with spaces. If so, spaces must be removed from the annotation files as well.<br />
<br />
== Creating the bowtie index ==<br />
<br />
Before starting to map reads with Tophat, an index for the reference sequence must be created. Tophat can use both Bowtie and the newer Bowtie2 programs; it is recommended to use Bowtie2. This means also using Bowtie2 in order to create the sequence index. The <span style="font-family:courier new,courier,monospace;">bowtie2-build</span> program must be given the filename of a fasta file containing the reference sequence. It is important that this file ends with a "<span style="font-family:courier new,courier,monospace;">.fa</span>" extension; rename the file if this is not the case. <span style="font-family:courier new,courier,monospace;">bowtie2-build</span> also must be given the output name. In order to avoid problems later on, the output name should be equal to the input sequence file name, excluding the "<span style="font-family:courier new,courier,monospace;">.fa</span>" extension:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
module load bowtie2<br />
<br />
bowtie2-build /path/to/refSequence.fasta /path/to/refSequence<br />
<br />
ls -l<br />
</div><br />
This will build the index, displaying the index files (files ending with a ".bt2" extension).<br />
<br />
== Running Tophat ==<br />
<br />
A default exection of Tophat needs an output folder, the reference sequence index files, at least one read file:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
module load tophat/2.0.14<br />
<br />
tophat -o /path/to/outputFolder /path/to/refSequence /path/to/readFile.fastq.gz<br />
</div><br />
It is often necessary to include more files and options, for instance, a GTF annotation file can be included so as to guide Tophat in the mapping process. Also, if using paired-end sequencing, the forward and reverse read files must both be provided. See the Tophat manual and the tutorials for details. Especial care must be taken if using strand-specific sequencing data, see here for an overview.<br />
<br />
== Estimating the insert size and standard deviation ==<br />
<br />
If using paired-end sequencing, the <span style="font-family:courier new,courier,monospace;">--mate-inner-dist</span> and the <span style="font-family:courier new,courier,monospace;">--mate-std-dev</span> parameters should be specified. Your sequencing provider should be able to report these values. If not, they can be estimated by performing a mapping using Bowtie, and calculating the values using the <span style="font-family:courier new,courier,monospace;">CollectInsertSizeMetrics</span> program in the Picard package. The Bowtie mapping must be done with the transcriptome, not the genome, as the reference sequence (in order to avoid that introns distort the insert size estimates). If a trancsriptome file is not available, it can be created from a GFF3 annotation file and the genomic sequences file using the gffread utility. This program is part of the cufflinks module which can be loaded as:<br />
<br />
<span style="font-family:courier new,courier,monospace;">module load cufflinks</span><br />
<br />
(cufflinks is not displayed when using the module avail statement, but is nevertheless installed on Abel!)<br />
<br />
See the gffread manual for details: [http://cole-trapnell-lab.github.io/cufflinks/file_formats/#the-gffread-utility http://cole-trapnell-lab.github.io/cufflinks/file_formats/#the-gffread-utility]<br />
<br />
It is not necessary to use all reads for getting an insert size estimate. In order to extract 10000 reads from a fastq.gz read file, use:<br />
<br />
<span style="font-family:courier new,courier,monospace;">zcat /path/to/readFile.fastq.gz | head -40000 > /path/to/outputFile.fastq</span><br />
<br />
Use Bowtie together with the transcriptome and the resulting forward and reverse read files so as to create the BAM file.<br />
<br />
In order to obtain the insert size values, do:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
module load R<br />
<br />
module load picard-tools<br />
<br />
java -Xmx1G -jar /cluster/software/VERSIONS/picard-tools-1.119/bin/CollectInsertSizeMetrics.jar H=/path/to/histogramFile.pdf I=/path/to/bamFile.bam O=/path/to/outputFile.txt<br />
<br />
more /path/to/outputFile.txt<br />
</div><br />
<br/>This will display the insert size metrics, including the average insert size and the standard deviation. The <span style="font-family:courier new,courier,monospace;">/path/to/histogramFile.pdf</span> file can be transferred to the loacl machine and viewed in any PDF viewer.</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq:_Mapping_reads_to_a_reference_sequence&diff=234RNASeq: Mapping reads to a reference sequence2015-05-28T09:59:41Z<p>Ralfne@uio.no: Created page with "= Introduction = On of the most popular mapping algorithms is the Bowtie program. However, this program cannot map reads that are spanning two exons. In order to enable mappi..."</p>
<hr />
<div>= Introduction =<br />
<br />
On of the most popular mapping algorithms is the Bowtie program. However, this program cannot map reads that are spanning two exons. In order to enable mapping of such reads, and with the added benefit of being able to confidently annotate exon-intron boundaries, the Tophat program has been created. It uses Bowtie as the underlying mapping algorithm, but is able to assign reads not mapped directly by Bowtie to distinct exons. For details, pleae see the Tophat manual ([https://ccb.jhu.edu/software/tophat/manual.shtml https://ccb.jhu.edu/software/tophat/manual.shtml]).<br />
<br />
= Mapping reads with Tophat =<br />
<br />
Since Tophat uses the Bowtie algorithm for the mapping of reads to a reference sequences, both programs must be installed for Tophat to be working. Both of these program are present on Abel; previous problems in letting the programs communicate have now been solved. However, the user must run the newest Tophat installation on Abel, not the default one. <br />
<br />
== Dealing with spaces in fasta sequence headers ==<br />
<br />
Before starting the mapping process, is is important to check whether the reference sequence fasta headers contain spaces. If so, this could lead to problems later in the mapping and expression analysis process. The reason for this is that some programs treat the fasta header section before the first space as the identifier, the rest is treated as a comment. This seems to be the case on for instance in biopython, or in the samtools programs, which are used by Tophat. Confusingly, the Picard programs that often are used together with samtools treat spaces differently. In order to avoid this problem, it may be a good idea to remove spaces in fasta headers (see here SMRT_Analysis:_Mapping_reads_to_a_reference). Notice however, that this may break other naming relationships. For instance, the annotation contained in a GTF file may contain fasta header names with spaces. If so, spaces must be removed from the annotation files as well. <br />
<br />
== Creating the bowtie index ==<br />
<br />
Before starting to map reads with Tophat, an index for the reference sequence must be created. Tophat can use both Bowtie and the newer Bowtie2 programs; it is recommended to use Bowtie2. This means also using Bowtie2 in order to create the sequence index. The <span style="font-family:courier new,courier,monospace;">bowtie2-build</span> program must be given the filename of a fasta file containing the reference sequence. It is important that this file ends with a "<span style="font-family:courier new,courier,monospace;">.fa</span>" extension; rename the file if this is not the case. <span style="font-family:courier new,courier,monospace;">bowtie2-build</span> also must be given the output name. In order to avoid problems later on, the output name should be equal to the input sequence file name, excluding the "<span style="font-family:courier new,courier,monospace;">.fa</span>" extension: <br />
<br />
module load bowtie2 <br />
<br />
bowtie2-build /path/to/refSequence.fasta /path/to/refSequence <br />
<br />
ls -l<br />
<br />
This will build the index, displaying the index files (files ending with a ".bt2" extension).<br />
<br />
== Running Tophat ==<br />
<br />
A default exection of Tophat needs an output folder, the reference sequence index files, at least one read file:<br />
<br />
module load tophat/2.0.14<br />
<br />
tophat -o /path/to/outputFolder /path/to/refSequence /path/to/readFile.fastq.gz<br />
<br />
It is often necessary to include more files and options, for instance, a GTF annotation file can be included so as to guide Tophat in the mapping process. Also, if using paired-end sequencing, the forward and reverse read files must both be provided. See the Tophat manual and the tutorials for details. Especial care must be taken if using strand-specific sequencing data, see here for an overview.<br />
<br />
== Estimating the insert size and standard deviation ==<br />
<br />
If using paired-end sequencing, the <span style="font-family:courier new,courier,monospace;">--mate-inner-dist</span> and the <span style="font-family:courier new,courier,monospace;">--mate-std-dev</span> parameters should be specified. Your sequencing provider should be able to report these values. If not, they can be estimated by performing a mapping using Bowtie, and calculating the values using the <span style="font-family:courier new,courier,monospace;">CollectInsertSizeMetrics</span> program in the Picard package. The Bowtie mapping must be done with the transcriptome, not the genome, as the reference sequence (in order to avoid that introns distort the insert size estimates). If a trancsriptome file is not available, it can be created from a GFF3 annotation file and the genomic sequences file using the gffread utility. This program is part of the cufflinks module which can be loaded as:<br />
<br />
<span style="font-family:courier new,courier,monospace;">module load cufflinks</span><br />
<br />
(cufflinks is not displayed when using the module avail statement, but is nevertheless installed on Abel!)<br />
<br />
See the gffread manual for details: [http://cole-trapnell-lab.github.io/cufflinks/file_formats/#the-gffread-utility http://cole-trapnell-lab.github.io/cufflinks/file_formats/#the-gffread-utility]<br />
<br />
It is not necessary to use all reads for getting an insert size estimate. In order to extract 10000 reads from a fastq.gz read file, use:<br />
<br />
<span style="font-family:courier new,courier,monospace;">zcat /path/to/readFile.fastq.gz | head -40000 > /path/to/outputFile.fastq</span><br />
<br />
Use Bowtie together with the transcriptome and the resulting forward and reverse read files so as to create the BAM file.<br />
<br />
In order to obtain the insert size values, do:<br />
<br />
module load R <br />
<br />
module load picard-tools <br />
<br />
java -Xmx1G -jar /cluster/software/VERSIONS/picard-tools-1.119/bin/CollectInsertSizeMetrics.jar H=/path/to/histogramFile.pdf I=/path/to/bamFile.bam O=/path/to/outputFile.txt<br />
<br />
more /path/to/outputFile.txt<br />
<br />
<br/>This will display the insert size metrics, including the average insert size and the standard deviation. The <span style="font-family:courier new,courier,monospace;">/path/to/histogramFile.pdf</span> file can be transferred to the loacl machine and viewed in any PDF viewer.</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq_and_differential_gene_expression_analysis&diff=233RNASeq and differential gene expression analysis2015-05-28T09:54:20Z<p>Ralfne@uio.no: </p>
<hr />
<div><div>Differential gene expression analysis using RNASeq implies obtaining RNA sequencing data for the conditions to be compared, mapping the RNA reads to the relevant genome (or transcriptome), counting the read coverage for features-of-interest, and using statistical procedures to infer whether the coverages vary in a systematic and statistical signifincant manner.</div><div><br/></div><div>This section contaIns some technical information for the users of Abel, the UoO high-performance computing cluster. It is not in itself a gene expression analysis tutorial. However, such a tutorial (taken from the UoO course INFBIO9120) is available for download [[Media:INF-BIOx120 RNASeq Analysis.pdf|here]]. This tutorial uses the older "DESeq" R package to do the statistical analysis. The newer "DESeq2" package is used in the following tutorial:</div><div><br/></div><div>[http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2014/140217_AgriOmics/mar_practical.pdf http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2014/140217_AgriOmics/mar_practical.pdf]</div><div><br/></div><br />
*[[RNASeq: Quality control|Quality control of sequencing data]]<br />
*[[RNASeq:_Mapping_reads_to_a_reference_sequence|Mapping reads to a reference sequence]]<br />
*[[RNASeq: Visualizing mapped reads|Visualizing mapped reads]]<br />
*Obtaining the read counts<br />
*Gene expression analysis<br />
*[[RNASeq: Dealing with stranded sequencing data|Dealing with stranded sequencing data]]</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq:_Quality_control&diff=232RNASeq: Quality control2015-05-27T13:06:35Z<p>Ralfne@uio.no: </p>
<hr />
<div>Before mapping reads to a reference sequence, it is important to assess the quality of the reads, possibly removing low-quality reads.<br />
<br />
= Viewing fastq read files =<br />
<br />
In order just to get a glimpse at the sequencing data, standard text commands can be used on Abel, such as <span style="font-family:courier new,courier,monospace;">less</span>, <span style="font-family:courier new,courier,monospace;">more</span> or <span style="font-family:courier new,courier,monospace;">cat</span>. Since the read files often are compressed (usually ending in a "<span style="font-family:courier new,courier,monospace;">.gz</span>" extension), special text commands can be used instead. Therefore, the read files do not have to be unzipped before usage (also, most bioinformatic programs working with fastq read files will accept gz-compressed files directly). Possible text commands include <span style="font-family:courier new,courier,monospace;">zmore</span>, <span style="font-family:courier new,courier,monospace;">zcat</span> and <span style="font-family:courier new,courier,monospace;">less</span> (the latter can thus be used both for compressed and un-compressed text data).<br />
<br />
= Generating quality reports =<br />
<br />
The <span style="font-family:courier new,courier,monospace;">fastaqc</span> program can be used to generate quality reports for fastq sequencing data. This program is installed on Abel, and can be used as follows:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
module load fastqc<br />
<br />
fastqc /path/to/nameOfReadFile.fastq<br />
</div><br />
(Note: <span style="font-family:courier new,courier,monospace;">fastqc</span> also accepts gz-compressed read files; i.e. the read file does not have to be unzipped before use).<br />
<br />
Executing the fastqc program only takes seconds even for large fastq read files; it is not necessary to start a job on Abel for this. Doing this directly from the Abel front-end should not pose any problems; alternatively you can use freebee. The "<span style="font-family:courier new,courier,monospace;">/path/to</span>" folder will now contain two new, fastqc-generated files: <span style="font-family:courier new,courier,monospace;">nameOfReadFile_fastq.html</span> and <span style="font-family:courier new,courier,monospace;">nameOfReadFile_fastq.zip</span>. The <span style="font-family:courier new,courier,monospace;">*.html</span> file contains the quality report that can be viewed in a browser; the <span style="font-family:courier new,courier,monospace;">*.zip</span> file contains the same information in a zipped format. To view this report, either copy the <span style="font-family:courier new,courier,monospace;">*.html</span> file over to your local machine and open it in a borwse, alternatively you can log onto Abel using the X11 system and open a browser on Abel directly (see [[SMRT Analysis: Viewing HTML reports on Abel|here]] for a guide).<br />
<br />
= Trimming low-quality reads =<br />
<br />
If the fastqc quality report has revealed low-quality reads (or sections of poor qualities in otherwise good reads), the <span style="font-family:courier new,courier,monospace;">Trimmomatic</span> program ([http://www.usadellab.org/cms/?page=trimmomatic http://www.usadellab.org/cms/?page=trimmomatic]) will remove entire low-quality reads, or delete the low-quality part of reads. This program is not installed on Abel.<br />
<br />
In order to download and install it on Abel, execute:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
mkdir trimmomatic<br />
<br />
cd trimmomatic<br />
<br />
wget [http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.33.zip http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.33.zip]<br />
<br />
unzip Trimmomatic-0.33.zip<br />
</div><br />
<br/>In order to execute <span style="font-family:courier new,courier,monospace;">Trimmomatic</span>, step into the "<span style="font-family:courier new,courier,monospace;">Trimmomatic-0.33</span>" sub-folder and use Java to start the .jar file:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
cd Trimmomatic-0.33<br />
<br />
java -jar trimmomatic-0.33.jar<br />
</div><br />
<br/>See the Trimmomatic web page for examples of how to use the program. Note that you can pass in several read files at once - you do not have to run Trimmomatic repeatedly if you have many read files. It is interesting to see the choice of quality cut-off values used on that web page - these example values are much lower than what fastqc suggests are acceptable scores. If in doubt about which values to use, consult the wikipedia page for the precise meaning of phred quality scores: [http://en.wikipedia.org/wiki/Phred_quality_score http://en.wikipedia.org/wiki/Phred_quality_score].<br />
<br />
= Assessing the effects of quality trimming =<br />
<br />
After removing low-quality reads, it is a good idea to assess how many reads and characters were removed from your read files. If loosing a great amount of data, possibly the quality trimming should be repeated with less stringent values (ideally, obviously, one should decide a priori on acceptable quality values, and stick with these...)<br />
<br />
One simple thing to do is to compare the read file sizes before and after trimming. Use the "<span style="font-family:courier new,courier,monospace;">ls -lh</span>" command so as to get better human-readable file size numbers.<br />
<br />
This will tell you how much data was lost. However, losses could occur either by removing (relatively few) entire reads, or alternatively the trimming of relatively many reads. In order to understand exactly what has happened, the number of fastq records before and after quality trimming can be compared:<br />
<br />
<span style="font-family:courier new,courier,monospace;">zcat /path/to/readFile.fastq.gz | grep "^+$" | wc</span><br />
<br />
(Here, the grep command, reading the un-compressed fastq file, is looking for line consisting only of the "+" character, used in the fastq format to separate the sequence from the quality values. These occurencies are passed into the word-count command wc.)</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq:_Visualizing_mapped_reads&diff=231RNASeq: Visualizing mapped reads2015-05-27T13:05:47Z<p>Ralfne@uio.no: </p>
<hr />
<div>After mapping reads to a reference sequence, it is often informative to look at the resulting BAM file. Software used for this kind of visualization must display the reads along the reference sequence, together with relevant annotation (typically exons and genes). Two popular programs used for this are Tablet ([http://ics.hutton.ac.uk/tablet http://ics.hutton.ac.uk/tablet]) and the Integrated Genome Viewer (IGV - [https://www.broadinstitute.org/igv https://www.broadinstitute.org/igv]). Tablet is a light-weight, user-friendly program that does the above and not much more. IGV has lots of additional features, but is not as intuitive to use as Tablet is.<br />
<br />
A HPC cluster such as Abel must be used for mapping with Tophat; visualization however typically will take place on the local machine. This means that a large BAM file (possibly many gigabytes) must be downloaded from Abel. This may take hours, depending on the network conditions. There are two possible shortcuts that can make this process easier.<br />
<br />
If only interested in manually checking the expression of a few genes all contained in the same contig sequence, it is possible to extract only the reads mapping to this contig from the BAM file:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
module load samtools<br />
<br />
samtools view -b /path/to/inputBamfile.bam nameOfContig > /path/to/outputBamfile.bam<br />
</div><br />
Before continuing, remember to create an index for the new BAM file:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
samtools index /path/to/outputBamfile.bam<br />
</div><br />
Depending on the number of contigs in the reference sequence, the <span style="font-family:courier new,courier,monospace;">outputBamfile.bam</span> may be less than 1% of the size of the original BAM file. When displaying the new BAM file on the local machine, the original reference fasta file and annotation file can be used (these are often much smaller than the BAM file; it is not worthwhile extracting only the contig in question from these files).<br />
<br />
The second option is to install a visualization software such as Tablet on Abel, using the X11 system to run it directly from Abel. In this way, the BAM file does not have to be transferred at all. This means installing X11 software such as Xming; see [[SMRT_Analysis:_Viewing_HTML_reports_on_Abel|here]] for a brief walkthrough. Installing and using Tabel via X11 from Abel works and does not require special permissions on Abel; it is however quite slow to use. Whether other visualization program also can be installed on Abel has not been investigated.<br />
<br />
The procedure for installing Tablet on Abel is as follows:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
mkdir tabletInstallation<br />
<br />
cd tabletInstallation<br />
<br />
wget [http://bioinf.hutton.ac.uk/tablet/installers/tablet_linux_x64_1_14_10_21.sh http://bioinf.hutton.ac.uk/tablet/installers/tablet_linux_x64_1_14_10_21.sh]<br />
<br />
chmod 777 tablet_linux_x64_1_14_10_21.sh<br />
<br />
./tablet_linux_x64_1_14_10_21.sh<br />
</div><br />
The installation script will suggest to install Tablet into the "Tablet" folder. Provided accepting this, Tablet is started by simply stepping into this folder and executing "<span style="font-family:courier new,courier,monospace;">./tablet</span>".</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq:_Quality_control&diff=230RNASeq: Quality control2015-05-27T13:04:47Z<p>Ralfne@uio.no: </p>
<hr />
<div>Before mapping reads to a reference sequence, it is important to assess the quality of the reads, possibly removing low-quality reads.<br />
<br />
<br />
<br />
= Viewing fastq read files =<br />
<br />
In order just to get a glimpse at the sequencing data, standard text commands can be used on Abel, such as <span style="font-family:courier new,courier,monospace;">less</span>, <span style="font-family:courier new,courier,monospace;">more</span> or <span style="font-family:courier new,courier,monospace;">cat</span>. Since the read files often are compressed (usually ending in a "<span style="font-family:courier new,courier,monospace;">.gz</span>" extension), special text commands can be used instead. Therefore, the read files do not have to be unzipped before usage (also, most bioinformatic programs working with fastq read files will accept gz-compressed files directly). Possible text commands include <span style="font-family:courier new,courier,monospace;">zmore</span>, <span style="font-family:courier new,courier,monospace;">zcat</span> and <span style="font-family:courier new,courier,monospace;">less</span> (the latter can thus be used both for compressed and un-compressed text data).<br />
<br />
= Generating quality reports =<br />
<br />
The <span style="font-family:courier new,courier,monospace;">fastaqc</span> program can be used to generate quality reports for fastq sequencing data. This program is installed on Abel, and can be used as follows:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
module load fastqc<br />
<br />
fastqc /path/to/nameOfReadFile.fastq<br />
</div><br />
(Note: <span style="font-family:courier new,courier,monospace;">fastqc</span> also accepts gz-compressed read files; i.e. the read file does not have to be unzipped before use).<br />
<br />
Executing the fastqc program only takes seconds even for large fastq read files; it is not necessary to start a job on Abel for this. Doing this directly from the Abel front-end should not pose any problems; alternatively you can use freebee. The "<span style="font-family:courier new,courier,monospace;">/path/to</span>" folder will now contain two new, fastqc-generated files: <span style="font-family:courier new,courier,monospace;">nameOfReadFile_fastq.html</span> and <span style="font-family:courier new,courier,monospace;">nameOfReadFile_fastq.zip</span>. The <span style="font-family:courier new,courier,monospace;">*.html</span> file contains the quality report that can be viewed in a browser; the <span style="font-family:courier new,courier,monospace;">*.zip</span> file contains the same information in a zipped format. To view this report, either copy the <span style="font-family:courier new,courier,monospace;">*.html</span> file over to your local machine and open it in a borwse, alternatively you can log onto Abel using the X11 system and open a browser on Abel directly (see [[SMRT_Analysis:_Viewing_HTML_reports_on_Abel|here]] for a guide).<br />
<br />
= Trimming low-quality reads =<br />
<br />
If the fastqc quality report has revealed low-quality reads (or sections of poor qualities in otherwise good reads), the <span style="font-family:courier new,courier,monospace;">Trimmomatic</span> program ([http://www.usadellab.org/cms/?page=trimmomatic http://www.usadellab.org/cms/?page=trimmomatic]) will remove entire low-quality reads, or delete the low-quality part of reads. This program is not installed on Abel.<br />
<br />
In order to download and install it on Abel, execute:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
mkdir trimmomatic<br />
<br />
cd trimmomatic<br />
<br />
wget [http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.33.zip http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.33.zip]<br />
<br />
unzip Trimmomatic-0.33.zip<br />
</div><br />
<br/>In order to execute <span style="font-family:courier new,courier,monospace;">Trimmomatic</span>, step into the "<span style="font-family:courier new,courier,monospace;">Trimmomatic-0.33</span>" sub-folder and use Java to start the .jar file:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
cd Trimmomatic-0.33<br />
<br />
java -jar trimmomatic-0.33.jar<br />
</div><br />
<br/>See the Trimmomatic web page for examples of how to use the program. Note that you can pass in several read files at once - you do not have to run Trimmomatic repeatedly if you have many read files. It is interesting to see the choice of quality cut-off values used on that web page - these example values are much lower than what fastqc suggests are acceptable scores. If in doubt about which values to use, consult the wikipedia page for the precise meaning of phred quality scores: [http://en.wikipedia.org/wiki/Phred_quality_score http://en.wikipedia.org/wiki/Phred_quality_score].<br />
<br />
= Assessing the effects of quality trimming =<br />
<br />
After removing low-quality reads, it is a good idea to assess how many reads and characters were removed from your read files. If loosing a great amount of data, possibly the quality trimming should be repeated with less stringent values (ideally, obviously, one should decide a priori on acceptable quality values, and stick with these...)<br />
<br />
One simple thing to do is to compare the read file sizes before and after trimming. Use the "<span style="font-family:courier new,courier,monospace;">ls -lh</span>" command so as to get better human-readable file size numbers.<br />
<br />
This will tell you how much data was lost. However, losses could occur either by removing (relatively few) entire reads, or alternatively the trimming of relatively many reads. In order to understand exactly what has happened, the number of fastq records before and after quality trimming can be compared:<br />
<br />
<span style="font-family:courier new,courier,monospace;">zcat /path/to/readFile.fastq.gz | grep "^+$" | wc</span><br />
<br />
(Here, the grep command, reading the un-compressed fastq file, is looking for line consisting only of the "+" character, used in the fastq format to separate the sequence from the quality values. These occurencies are passed into the word-count command wc.)</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq:_Quality_control&diff=229RNASeq: Quality control2015-05-27T13:03:22Z<p>Ralfne@uio.no: </p>
<hr />
<div>Before mapping reads to a reference sequence, it is important to assess the quality of the reads, possibly removing low-quality reads.<br />
<br />
<br />
<br />
= Viewing fastq read files =<br />
<br />
In order just to get a glimpse at the sequencing data, standard text commands can be used on Abel, such as <span style="font-family:courier new,courier,monospace;">less</span>, <span style="font-family:courier new,courier,monospace;">more</span> or <span style="font-family:courier new,courier,monospace;">cat</span>. Since the read files often are compressed (usually ending in a "<span style="font-family:courier new,courier,monospace;">.gz</span>" extension), special text commands can be used instead. Therefore, the read files do not have to be unzipped before usage (also, most bioinformatic programs working with fastq read files will accept gz-compressed files directly). Possible text commands include <span style="font-family:courier new,courier,monospace;">zmore</span>, <span style="font-family:courier new,courier,monospace;">zcat</span> and <span style="font-family:courier new,courier,monospace;">less</span> (the latter can thus be used both for compressed and un-compressed text data).<br />
<br />
= Generating quality reports =<br />
<br />
The <span style="font-family:courier new,courier,monospace;">fastaqc</span> program can be used to generate quality reports for fastq sequencing data. This program is installed on Abel, and can be used as follows:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
module load fastqc<br />
<br />
fastqc /path/to/nameOfReadFile.fastq<br />
</div><br />
(Note: <span style="font-family:courier new,courier,monospace;">fastqc</span> also accepts gz-compressed read files; i.e. the read file does not have to be unzipped before use).<br />
<br />
Executing the fastqc program only takes seconds even for large fastq read files; it is not necessary to start a job on Abel for this. Doing this directly from the Abel front-end should not pose any problems; alternatively you can use freebee. The "<span style="font-family:courier new,courier,monospace;">/path/to</span>" folder will now contain two new, fastqc-generated files: <span style="font-family:courier new,courier,monospace;">nameOfReadFile_fastq.html</span> and <span style="font-family:courier new,courier,monospace;">nameOfReadFile_fastq.zip</span>. The <span style="font-family:courier new,courier,monospace;">*.html</span> file contains the quality report that can be viewed in a browser; the <span style="font-family:courier new,courier,monospace;">*.zip</span> file contains the same information in a zipped format. To view this report, either copy the <span style="font-family:courier new,courier,monospace;">*.html</span> file over to your local machine and open it in a borwse, alternatively you can log onto Abel using the X11 system and open a browser on Abel directly (see here for a guide).<br />
<br />
= Trimming low-quality reads =<br />
<br />
If the fastqc quality report has revealed low-quality reads (or sections of poor qualities in otherwise good reads), the <span style="font-family:courier new,courier,monospace;">Trimmomatic</span> program ([http://www.usadellab.org/cms/?page=trimmomatic http://www.usadellab.org/cms/?page=trimmomatic]) will remove entire low-quality reads, or delete the low-quality part of reads. This program is not installed on Abel.<br />
<br />
In order to download and install it on Abel, execute:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
mkdir trimmomatic<br />
<br />
cd trimmomatic<br />
<br />
wget [http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.33.zip http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.33.zip]<br />
<br />
unzip Trimmomatic-0.33.zip<br />
</div><br />
<br/>In order to execute <span style="font-family:courier new,courier,monospace;">Trimmomatic</span>, step into the "<span style="font-family:courier new,courier,monospace;">Trimmomatic-0.33</span>" sub-folder and use Java to start the .jar file:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
cd Trimmomatic-0.33<br />
<br />
java -jar trimmomatic-0.33.jar<br />
</div><br />
<br/>See the Trimmomatic web page for examples of how to use the program. Note that you can pass in several read files at once - you do not have to run Trimmomatic repeatedly if you have many read files. It is interesting to see the choice of quality cut-off values used on that web page - these example values are much lower than what fastqc suggests are acceptable scores. If in doubt about which values to use, consult the wikipedia page for the precise meaning of phred quality scores: [http://en.wikipedia.org/wiki/Phred_quality_score http://en.wikipedia.org/wiki/Phred_quality_score].<br />
<br />
= Assessing the effects of quality trimming =<br />
<br />
After removing low-quality reads, it is a good idea to assess how many reads and characters were removed from your read files. If loosing a great amount of data, possibly the quality trimming should be repeated with less stringent values (ideally, obviously, one should decide a priori on acceptable quality values, and stick with these...)<br />
<br />
One simple thing to do is to compare the read file sizes before and after trimming. Use the "<span style="font-family:courier new,courier,monospace;">ls -lh</span>" command so as to get better human-readable file size numbers.<br />
<br />
This will tell you how much data was lost. However, losses could occur either by removing (relatively few) entire reads, or alternatively the trimming of relatively many reads. In order to understand exactly what has happened, the number of fastq records before and after quality trimming can be compared:<br />
<br />
<span style="font-family:courier new,courier,monospace;">zcat /path/to/readFile.fastq.gz | grep "^+$" | wc</span><br />
<br />
(Here, the grep command, reading the un-compressed fastq file, is looking for line consisting only of the "+" character, used in the fastq format to separate the sequence from the quality values. These occurencies are passed into the word-count command wc.)</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq:_Quality_control&diff=228RNASeq: Quality control2015-05-27T12:58:51Z<p>Ralfne@uio.no: Created page with "Before mapping reads to a reference sequence, it is important to assess the quality of the reads, possibly removing low-quality reads. = Viewing fastq read files = In order..."</p>
<hr />
<div>Before mapping reads to a reference sequence, it is important to assess the quality of the reads, possibly removing low-quality reads. <br />
<br />
= Viewing fastq read files =<br />
<br />
In order just to get a glimpse at the sequencing data, standard text commands can be used on Abel, such as <span style="font-family:courier new,courier,monospace;">less</span>, <span style="font-family:courier new,courier,monospace;">more</span> or <span style="font-family:courier new,courier,monospace;">cat</span>. Since the read files often are compressed (usually ending in a "<span style="font-family:courier new,courier,monospace;">.gz</span>" extension), special text commands can be used instead. Therefore, the read files do not have to be unzipped before usage (also, most bioinformatic programs working with fastq read files will accept gz-compressed files directly). Possible text commands include <span style="font-family:courier new,courier,monospace;">zmore</span>, <span style="font-family:courier new,courier,monospace;">zcat</span> and <span style="font-family:courier new,courier,monospace;">less</span> (the latter can thus be used both for compressed and un-compressed text data). <br />
<br />
= Generating quality reports =<br />
<br />
The <span style="font-family:courier new,courier,monospace;">fastaqc</span> program can be used to generate quality reports for fastq sequencing data. This program is installed on Abel, and can be used as follows:<br />
<br />
module load fastqc <br />
<br />
fastqc /path/to/nameOfReadFile.fastq<br />
<br />
(Note: <span style="font-family:courier new,courier,monospace;">fastqc</span> also accepts gz-compressed read files; i.e. the read file does not have to be unzipped before use). <br />
<br />
Executing the fastqc program only takes seconds even for large fastq read files; it is not necessary to start a job on Abel for this. Doing this directly from the Abel front-end should not pose any problems; alternatively you can use freebee. The "<span style="font-family:courier new,courier,monospace;">/path/to</span>" folder will now contain two new, fastqc-generated files: <span style="font-family:courier new,courier,monospace;">nameOfReadFile_fastq.html</span> and <span style="font-family:courier new,courier,monospace;">nameOfReadFile_fastq.zip</span>. The <span style="font-family:courier new,courier,monospace;">*.html</span> file contains the quality report that can be viewed in a browser; the <span style="font-family:courier new,courier,monospace;">*.zip</span> file contains the same information in a zipped format. To view this report, either copy the <span style="font-family:courier new,courier,monospace;">*.html</span> file over to your local machine and open it in a borwse, alternatively you can log onto Abel using the X11 system and open a browser on Abel directly (see here for a guide). <br />
<br />
= Trimming low-quality reads =<br />
<br />
If the fastqc quality report has revealed low-quality reads (or sections of poor qualities in otherwise good reads), the <span style="font-family:courier new,courier,monospace;">Trimmomatic</span> program ([http://www.usadellab.org/cms/?page=trimmomatic http://www.usadellab.org/cms/?page=trimmomatic]) will remove entire low-quality reads, or delete the low-quality part of reads. This program is not installed on Abel. <br />
<br />
In order to download and install it on Abel, execute: <br />
<br />
mkdir trimmomatic <br />
<br />
cd trimmomatic <br />
<br />
wget [http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.33.zip http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.33.zip] <br />
<br />
unzip Trimmomatic-0.33.zip<br />
<br />
<br />
<br />
In order to execute <span style="font-family:courier new,courier,monospace;">Trimmomatic</span>, step into the "<span style="font-family:courier new,courier,monospace;">Trimmomatic-0.33</span>" sub-folder and use Java to start the .jar file: <br />
<br />
<br />
<br />
cd Trimmomatic-0.33 <br />
<br />
java -jar trimmomatic-0.33.jar<br />
<br />
<br />
<br />
See the Trimmomatic web page for examples of how to use the program. Note that you can pass in several read files at once - you do not have to run Trimmomatic repeatedly if you have many read files. It is interesting to see the choice of quality cut-off values used on that web page - these example values are much lower than what fastqc suggests are acceptable scores. If in doubt about which values to use, consult the wikipedia page for the precise meaning of phred quality scores: [http://en.wikipedia.org/wiki/Phred_quality_score http://en.wikipedia.org/wiki/Phred_quality_score].<br />
<br />
= Assessing the effects of quality trimming =<br />
<br />
After removing low-quality reads, it is a good idea to assess how many reads and characters were removed from your read files. If loosing a great amount of data, possibly the quality trimming should be repeated with less stringent values (ideally, obviously, one should decide a priori on acceptable quality values, and stick with these...)<br />
<br />
One simple thing to do is to compare the read file sizes before and after trimming. Use the "<span style="font-family:courier new,courier,monospace;">ls -lh</span>" command so as to get better human-readable file size numbers.<br />
<br />
This will tell you how much data was lost. However, losses could occur either by removing (relatively few) entire reads, or alternatively the trimming of relatively many reads. In order to understand exactly what has happened, the number of fastq records before and after quality trimming can be compared:<br />
<br />
<span style="font-family:courier new,courier,monospace;">zcat /path/to/readFile.fastq.gz | grep "^+$" | wc</span><br />
<br />
(Here, the grep command, reading the un-compressed fastq file, is looking for line consisting only of the "+" character, used in the fastq format to separate the sequence from the quality values. These occurencies are passed into the word-count command wc.)</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq_and_differential_gene_expression_analysis&diff=227RNASeq and differential gene expression analysis2015-05-27T12:52:17Z<p>Ralfne@uio.no: </p>
<hr />
<div><div>Differential gene expression analysis using RNASeq implies obtaining RNA sequencing data for the conditions to be compared, mapping the RNA reads to the relevant genome (or transcriptome), counting the read coverage for features-of-interest, and using statistical procedures to infer whether the coverages vary in a systematic and statistical signifincant manner.</div><div><br/></div><div>This section contaIns some technical information for the users of Abel, the UoO high-performance computing cluster. It is not in itself a gene expression analysis tutorial. However, such a tutorial (taken from the UoO course INFBIO9120) is available for download [[Media:INF-BIOx120 RNASeq Analysis.pdf|here]]. This tutorial uses the older "DESeq" R package to do the statistical analysis. The newer "DESeq2" package is used in the following tutorial:</div><div><br/></div><div>[http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2014/140217_AgriOmics/mar_practical.pdf http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2014/140217_AgriOmics/mar_practical.pdf]</div><div><br/></div><br />
*[[RNASeq:_Quality_control|Quality control of sequencing data]]<br />
*Mapping reads to a genome<br />
*[[RNASeq: Visualizing mapped reads|Visualizing mapped reads]]<br />
*Obtaining the read counts<br />
*Gene expression analysis<br />
*[[RNASeq: Dealing with stranded sequencing data|Dealing with stranded sequencing data]]</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq:_Visualizing_mapped_reads&diff=226RNASeq: Visualizing mapped reads2015-05-27T11:38:23Z<p>Ralfne@uio.no: </p>
<hr />
<div>After mapping reads to a reference sequence, it is often informative to look at the resulting BAM file. Software used for this kind of visualization must display the reads along the reference sequence, together with relevant annotation (typically exons and genes). Two popular programs used for this are Tablet ([http://ics.hutton.ac.uk/tablet http://ics.hutton.ac.uk/tablet]) and the Integrated Genome Viewer (IGV - [https://www.broadinstitute.org/igv https://www.broadinstitute.org/igv]). Tablet is a light-weight, user-friendly program that does the above and not much more. IGV has lots of additional features, but is not as intuitive to use as Tablet is.<br />
<br />
A HPC cluster such as Abel must be used for mapping with Tophat; visualization however typically will take place on the local machine. This means that a large BAM file (possibly many gigabytes) must be downloaded from Abel. This may take hours, depending on the network conditions. There are two possible shortcuts that can make this process easier.<br />
<br />
If only interested in manually checking the expression of a few genes all contained in the same contig sequence, it is possible to extract only the reads mapping to this contig from the BAM file:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
module load samtools<br />
<br />
samtools view -b /path/to/inputBamfile.bam nameOfContig > /path/to/outputBamfile.bam<br />
</div><br />
Before continuing, remember to create an index for the new BAM file:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
samtools index /path/to/outputBamfile.bam<br />
</div><br />
Depending on the number of contigs in the reference sequence, the <span style="font-family:courier new,courier,monospace;">outputBamfile.bam</span> may be less than 1% of the size of the original BAM file. When displaying the new BAM file on the local machine, the original reference fasta file and annotation file can be used (these are often much smaller than the BAM file; it is not worthwhile extracting only the contig in question from these files).<br />
<br />
The second option is to install a visualization software such as Tablet on Abel, using the X11 system to run it directly from Abel. In this way, the BAM file does not have to be transferred at all. This means installing X11 software such as Xming; see here for a brief walkthrough. Installing and using Tabel via X11 from Abel works and does not require special permissions on Abel; it is however quite slow to use. Whether other visualization program also can be installed on Abel has not been investigated.<br />
<br />
The procedure for installing Tablet on Abel is as follows:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
mkdir tabletInstallation<br />
<br />
cd tabletInstallation<br />
<br />
wget [http://bioinf.hutton.ac.uk/tablet/installers/tablet_linux_x64_1_14_10_21.sh http://bioinf.hutton.ac.uk/tablet/installers/tablet_linux_x64_1_14_10_21.sh]<br />
<br />
chmod 777 tablet_linux_x64_1_14_10_21.sh<br />
<br />
./tablet_linux_x64_1_14_10_21.sh<br />
</div><br />
The installation script will suggest to install Tablet into the "Tablet" folder. Provided accepting this, Tablet is started by simply stepping into this folder and executing "<span style="font-family:courier new,courier,monospace;">./tablet</span>".</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq:_Visualizing_mapped_reads&diff=225RNASeq: Visualizing mapped reads2015-05-27T11:36:31Z<p>Ralfne@uio.no: Created page with "After mapping reads to a reference sequence, it is often informative to look at the resulting BAM file. Software used for this kind of visualization must display the reads alo..."</p>
<hr />
<div>After mapping reads to a reference sequence, it is often informative to look at the resulting BAM file. Software used for this kind of visualization must display the reads along the reference sequence, together with relevant annotation (typically exons and genes). Two popular programs used for this are Tablet ([http://ics.hutton.ac.uk/tablet http://ics.hutton.ac.uk/tablet]) and the Integrated Genome Viewer (IGV - [https://www.broadinstitute.org/igv https://www.broadinstitute.org/igv]). Tablet is a light-weight, user-friendly program that does the above and not much more. IGV has lots of additional features, but is not as intuitive to use as Tablet is. <br />
<br />
A HPC cluster such as Abel must be used for mapping with Tophat; visualization however typically will take place on the local machine. This means that a large BAM file (possibly many gigabytes) must be downloaded from Abel. This may take hours, depending on the network conditions. There are two possible shortcuts that can make this process easier. <br />
<br />
If only interested in manually checking the expression of a few genes all contained in the same contig sequence, it is possible to extract only the reads mapping to this contig from the BAM file: <br />
<br />
module load samtools <br />
<br />
samtools view -b /path/to/inputBamfile.bam nameOfContig > /path/to/outputBamfile.bam <br />
<br />
Before continuing, remember to create an index for the new BAM file: <br />
<br />
samtools index /path/to/outputBamfile.bam <br />
<br />
Depending on the number of contigs in the reference sequence, the <span style="font-family:courier new,courier,monospace;">outputBamfile.bam</span> may be less than 1% of the size of the original BAM file. When displaying the new BAM file on the local machine, the original reference fasta file and annotation file can be used (these are often much smaller than the BAM file; it is not worthwhile extracting only the contig in question from these files). <br />
<br />
The second option is to install a visualization software such as Tablet on Abel, using the X11 system to run it directly from Abel. In this way, the BAM file does not have to be transferred at all. This means installing X11 software such as Xming; see here for a brief walkthrough. Installing and using Tabel via X11 from Abel works and does not require special permissions on Abel; it is however quite slow to use. Whether other visualization program also can be installed on Abel has not been investigated. <br />
<br />
The procedure for installing Tablet on Abel is as follows:<br />
<br />
mkdir tabletInstallation <br />
<br />
cd tabletInstallation <br />
<br />
wget [http://bioinf.hutton.ac.uk/tablet/installers/tablet_linux_x64_1_14_10_21.sh http://bioinf.hutton.ac.uk/tablet/installers/tablet_linux_x64_1_14_10_21.sh] <br />
<br />
chmod 777 tablet_linux_x64_1_14_10_21.sh <br />
<br />
./tablet_linux_x64_1_14_10_21.sh<br />
<br />
The installation script will suggest to install Tablet into the "Tablet" folder. Provided accepting this, Tablet is started by simply stepping into this folder and executing "<span style="font-family:courier new,courier,monospace;">./tablet</span>".</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq_and_differential_gene_expression_analysis&diff=224RNASeq and differential gene expression analysis2015-05-27T11:33:02Z<p>Ralfne@uio.no: </p>
<hr />
<div><div>Differential gene expression analysis using RNASeq implies obtaining RNA sequencing data for the conditions to be compared, mapping the RNA reads to the relevant genome (or transcriptome), counting the read coverage for features-of-interest, and using statistical procedures to infer whether the coverages vary in a systematic and statistical signifincant manner.</div><div><br/></div><div>This section contaIns some technical information for the users of Abel, the UoO high-performance computing cluster. It is not in itself a gene expression analysis tutorial. However, such a tutorial (taken from the UoO course INFBIO9120) is available for download [[Media:INF-BIOx120 RNASeq Analysis.pdf|here]]. This tutorial uses the older "DESeq" R package to do the statistical analysis. The newer "DESeq2" package is used in the following tutorial:</div><div><br/></div><div>[http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2014/140217_AgriOmics/mar_practical.pdf http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2014/140217_AgriOmics/mar_practical.pdf]</div><div><br/></div><br />
*Quality control of sequencing data<br />
*Mapping reads to a genome<br />
*[[RNASeq:_Visualizing_mapped_reads|Visualizing mapped reads]]<br />
*Obtaining the read counts<br />
*Gene expression analysis<br />
*[[RNASeq: Dealing with stranded sequencing data|Dealing with stranded sequencing data]]</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq:_Dealing_with_stranded_sequencing_data&diff=223RNASeq: Dealing with stranded sequencing data2015-05-26T14:15:26Z<p>Ralfne@uio.no: </p>
<hr />
<div>= Introduction =<br />
<br />
If using strand-specific sequencing, it becomes very important to make sure the coverages for the features-of-interest (usually genes) are calculated along the correct strand. Getting this wrong will mean that the subsequent step of finding differentially expressed genes is finding differentially expressed anti-sense transcripts instead of the actual genes. It may not be easy to identify such a problem at a latter stage, since all subsequent information consists of abstract numbers and identifiers.<br />
<br />
The correct way of counting strand-specific sequencing data depends on:<br />
<br />
*the type of molecule sequenced (for instance, cDNAs)<br />
*the sequencing technology used (for instance, Illumina)<br />
*the particular sequencing kit used (for instance, lluminaTruSeq Stranded Total RNA Sample Prep Kit with paired-end sequencing)<br />
*the settings used when mapping the reads to the reference. For the Tophat aligner, the two critical parameters are --library-type, and the order of passing in the forward and reverse read files.<br />
*the settings used when calculating the read counts.<br />
<br />
= Example =<br />
<br />
The following describes a situation where the Illumina TruSeq kit was used to obtain stranded paired-end RNASeq data genereted from cDNA molecules.<br />
<br />
== Mapping the RNASeq reads to the reference sequence ==<br />
<br />
Since sequencing was done using the TruSeq kit, the Tophat <span style="font-family:courier new,courier,monospace;">--library-type</span> parameter must be set to <span style="font-family:courier new,courier,monospace;">fr-firststrand</span>. Also, the reverse read file (<span style="font-family:courier new,courier,monospace;">R2.fastq.gz</span>) is passed into Tophat before the forward read file (<span style="font-family:courier new,courier,monospace;">R1.fastq.gz</span>). This is according to the INFBIO9120 tutorial, which states:<br />
<br />
Please note that we are providing R2 files first and R1 reads last, although the Tophat2 documentation says we should provide R1 first and R2 second. We do this because the protocol used to produce these data sequence the cDNA, and by switching R1 and R2 we will provide the correct strand orientation to the aligner.<br />
<br />
The command to execute thus becomes:<br />
<br />
<span style="font-family:courier new,courier,monospace;">tophat -o /path/to/outputFolder --library-type fr-firststrand --GTF /path/to/annotation.gtf /path/to/referenceSequenceBowtieIndex /path/to/R2.fastq.gz /path/to/R1.fastq.gz > /path/to/log.txt</span><br />
<br />
== Visualizing the mapped reads ==<br />
<br />
Even though not strictly required, is is a good idea to visualize the mapped reads, so as to better understand the subsequent steps. In this example, the ordering of the read pairs along the reference sequence determines which strand the seuqenced fragment originated from. For genes on the sense strand (i.e. for genes on the "+" strand, with their 5' ends left and their 3' ends right), the forward read (R1) is the first (left-most) read of the read pair. In Tablet, this is visualized with the "green" R1 reads coming before the "blue" R2 reads (of the same read pair):<br />
<br />
[[File:Tablet stranded reads.jpg]]<br />
<br />
(In IGV, the read ordering will be designated as "F1R2"). The R1 reads will be identical to the reference sequence; the R2 reads will be the reverse-complement of the reference sequence. For genes on the anti-sense strand (the "-" strand), the read ordering will be the opposite. In Tablet, the "blue" R2 read of a given read pair will come before (i.e. be situated left of) the "green" R1 read. (In IGV, the read ordering will be designated as "F2R1").<br />
<br />
== Counting reads ==<br />
<br />
=== Using R and the summarizeOverlaps function ===<br />
<br />
If using R, the "<span style="font-family:courier new,courier,monospace;">summarizeOverlaps" </span>method (used in the INFBIO9120 tutorial) will now give correct read counts. This can be verified as follows:<br />
<br />
Using Tablet (or IGV), identify a region with only a few reads, where all the read pairs are present in the same orientation (i.e. transcription is only seen for one strand). Create a test annotation file that spans this region. In the following gff3 example, replace the <span style="font-family:courier new,courier,monospace;">&lt;ContigName&gt;</span> with the name of the contig containing the selected region, and <span style="font-family:courier new,courier,monospace;">&lt;from&gt;</span> and <span style="font-family:courier new,courier,monospace;">&lt;to&gt;</span> with the numeric coordiantes defining the region:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
&lt;ContigName&gt; testSource mRNA &lt;from&gt; &lt;to&gt; . + . ID=ID1;Parent=ID1;Name=ID1<br />
<br />
&lt;ContigName&gt; testSource exon &lt;from&gt; &lt;to&gt; . + . Parent=ID1<br />
</div><br />
(Use Tablet to make sure the annotation is created correctly)<br />
<br />
The following R code will count the reads contained in the defined region:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
library(ShortRead)<br />
<br />
library(rtracklayer)<br />
<br />
aln <- readGAlignmentPairsFromBam('/path/to/BAMfile.bam', use.names=T)<br />
<br />
genes<-import('/path/to/testAnnotationFile.gff3', asRangedData=FALSE)<br />
<br />
splitgenes<-split(genes,genes$ID)<br />
<br />
genehits<-summarizeOverlaps(splitgenes,aln,mode="Union",singleEnd=FALSE, ignore.strand=FALSE)<br />
<br />
counts<-assays(genehits)$counts<br />
<br />
counts<br />
</div><br />
This will print the number of reads (of rather, number of fragments) that map to the defined region (it will be zero if the reads stem from the anti-sense strand - the test annotation was created for the "+" strand).<br />
<br />
=== Using HTSeq-count ===<br />
<br />
If using the HTSeq-count python program for read counting, the <span style="font-family:courier new,courier,monospace;">--stranded</span> parameter must be set to "<span style="font-family:courier new,courier,monospace;">yes</span>" to count the stranded data in our example correctly. Alternatively, Tophat can be used with the R1 reads before the R2 reads, followed by import into HTSeq-count using the <span style="font-family:courier new,courier,monospace;">--stranded=reverse</span> parameter (this seems to be the standard way of using HTSeq, according to the creators of HTSeq).</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq:_Dealing_with_stranded_sequencing_data&diff=222RNASeq: Dealing with stranded sequencing data2015-05-26T14:14:12Z<p>Ralfne@uio.no: </p>
<hr />
<div>= Introduction =<br />
<br />
If using strand-specific sequencing, it becomes very important to make sure the coverages for the features-of-interest (usually genes) are calculated along the correct strand. Getting this wrong will mean that the subsequent step of finding differentially expressed genes is finding differentially expressed anti-sense transcripts instead of the actual genes. It may not be easy to identify such a problem at a latter stage, since all subsequent information consists of abstract numbers and identifiers.<br />
<br />
The correct way of counting strand-specific sequencing data depends on:<br />
<br />
*the type of molecule sequenced (for instance, cDNAs)<br />
*the sequencing technology used (for instance, Illumina)<br />
*the particular sequencing kit used (for instance, lluminaTruSeq Stranded Total RNA Sample Prep Kit with paired-end sequencing)<br />
*the settings used when mapping the reads to the reference. For the Tophat aligner, the two critical parameters are --library-type, and the order of passing in the forward and reverse read files.<br />
*the settings used when calculating the read counts.<br />
<br />
= Example =<br />
<br />
The following describes a situation where the Illumina TruSeq kit was used to obtain stranded paired-end RNASeq data genereted from cDNA molecules.<br />
<br />
== Mapping the RNASeq reads to the reference sequence ==<br />
<br />
Since sequencing was done using the TruSeq kit, the Tophat <span style="font-family:courier new,courier,monospace;">--library-type</span> parameter must be set to <span style="font-family:courier new,courier,monospace;">fr-firststrand</span>. Also, the reverse read file (<span style="font-family:courier new,courier,monospace;">R2.fastq.gz</span>) is passed into Tophat before the forward read file (<span style="font-family:courier new,courier,monospace;">R1.fastq.gz</span>). This is according to the INFBIO9120 tutorial, which states:<br />
<br />
Please note that we are providing R2 files first and R1 reads last, although the Tophat2 documentation says we should provide R1 first and R2 second. We do this because the protocol used to produce these data sequence the cDNA, and by switching R1 and R2 we will provide the correct strand orientation to the aligner.<br />
<br />
The command to execute thus becomes:<br />
<br />
<span style="font-family:courier new,courier,monospace;">tophat -o /path/to/outputFolder --library-type fr-firststrand --GTF /path/to/annotation.gtf /path/to/referenceSequenceBowtieIndex /path/to/R2.fastq.gz /path/to/R1.fastq.gz > /path/to/log.txt</span><br />
<br />
== Visualizing the mapped reads ==<br />
<br />
Even though not strictly required, is is a good idea to visualize the mapped reads, so as to better understand the subsequent steps. In this example, the ordering of the read pairs along the reference sequence determines which strand the seuqenced fragment originated from. For genes on the sense strand (i.e. for genes on the "+" strand, with their 5' ends left and their 3' ends right), the forward read (R1) is the first (left-most) read of the read pair. In Tablet, this is visualized with the "green" R1 reads coming before the "blue" R2 reads (of the same read pair):<br />
<br />
[[File:Tablet stranded reads.jpg]]<br />
<br />
(In IGV, the read ordering will be designated as "F1R2"). The R1 reads will be identical to the reference sequence; the R2 reads will be the reverse-complement of the reference sequence. For genes on the anti-sense strand (the "-" strand), the read ordering will be the opposite. In Tablet, the "blue" R2 read of a given read pair will come before (i.e. be situated left of) the "green" R1 read. (In IGV, the read ordering will be designated as "F2R1").<br />
<br />
== Counting reads ==<br />
<br />
=== Using R and the summarizeOverlaps function ===<br />
<br />
If using R, the "<span style="font-family:courier new,courier,monospace;">summarizeOverlaps" </span>method (used in the INFBIO9120 tutorial) will now give correct read counts. This can be verified as follows:<br />
<br />
Using Tablet (or IGV), identify a region with only a few reads, where all the read pairs are present in the same orientation (i.e. transcription is only seen for one strand). Create a test annotation file that spans this region. In the following gff3 example, replace the &lt;ContigName&gt; with the name of the contig containing the selected region, and &lt;from&gt; and &lt;to&gt; with the numeric coordiantes defining the region:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
&lt;ContigName&gt; testSource mRNA &lt;from&gt; &lt;to&gt; . + . ID=ID1;Parent=ID1;Name=ID1<br />
<br />
&lt;ContigName&gt; testSource exon &lt;from&gt; &lt;to&gt; . + . Parent=ID1<br />
</div><br />
(Use Tablet to make sure the annotation is created correctly)<br />
<br />
The following R code will count the reads contained in the defined region:<br />
<div style="line-height:90%; background-color: LightGray; border-style: solid; border-width:1px; font-family:courier new,courier,monospace;"><br />
library(ShortRead)<br />
<br />
library(rtracklayer)<br />
<br />
aln <- readGAlignmentPairsFromBam('/path/to/BAMfile.bam', use.names=T)<br />
<br />
genes<-import('/path/to/testAnnotationFile.gff3', asRangedData=FALSE)<br />
<br />
splitgenes<-split(genes,genes$ID)<br />
<br />
genehits<-summarizeOverlaps(splitgenes,aln,mode="Union",singleEnd=FALSE, ignore.strand=FALSE)<br />
<br />
counts<-assays(genehits)$counts<br />
<br />
counts<br />
</div><br />
This will print the number of reads (of rather, number of fragments) that map to the defined region (it will be zero if the reads stem from the anti-sense strand - the test annotation was created for the "+" strand).<br />
<br />
=== Using HTSeq-count ===<br />
<br />
If using the HTSeq-count python program for read counting, the <span style="font-family:courier new,courier,monospace;">--stranded</span> parameter must be set to "<span style="font-family:courier new,courier,monospace;">yes</span>" to count the stranded data in our example correctly. Alternatively, Tophat can be used with the R1 reads before the R2 reads, followed by import into HTSeq-count using the <span style="font-family:courier new,courier,monospace;">--stranded=reverse</span> parameter (this seems to be the standard way of using HTSeq, according to the creators of HTSeq).</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq:_Dealing_with_stranded_sequencing_data&diff=221RNASeq: Dealing with stranded sequencing data2015-05-26T14:11:57Z<p>Ralfne@uio.no: </p>
<hr />
<div>= Introduction =<br />
<br />
If using strand-specific sequencing, it becomes very important to make sure the coverages for the features-of-interest (usually genes) are calculated along the correct strand. Getting this wrong will mean that the subsequent step of finding differentially expressed genes is finding differentially expressed anti-sense transcripts instead of the actual genes. It may not be easy to identify such a problem at a latter stage, since all subsequent information consists of abstract numbers and identifiers.<br />
<br />
The correct way of counting strand-specific sequencing data depends on:<br />
<br />
*the type of molecule sequenced (for instance, cDNAs)<br />
*the sequencing technology used (for instance, Illumina)<br />
*the particular sequencing kit used (for instance, lluminaTruSeq Stranded Total RNA Sample Prep Kit with paired-end sequencing)<br />
*the settings used when mapping the reads to the reference. For the Tophat aligner, the two critical parameters are --library-type, and the order of passing in the forward and reverse read files.<br />
*the settings used when calculating the read counts.<br />
<br />
= Example =<br />
<br />
The following describes a situation where the Illumina TruSeq kit was used to obtain stranded paired-end RNASeq data genereted from cDNA molecules.<br />
<br />
== Mapping the RNASeq reads to the reference sequence ==<br />
<br />
Since sequencing was done using the TruSeq kit, the Tophat <span style="font-family:courier new,courier,monospace;">--library-type</span> parameter must be set to <span style="font-family:courier new,courier,monospace;">fr-firststrand</span>. Also, the reverse read file (<span style="font-family:courier new,courier,monospace;">R2.fastq.gz</span>) is passed into Tophat before the forward read file (<span style="font-family:courier new,courier,monospace;">R1.fastq.gz</span>). This is according to the INFBIO9120 tutorial, which states:<br />
<br />
Please note that we are providing R2 files first and R1 reads last, although the Tophat2 documentation says we should provide R1 first and R2 second. We do this because the protocol used to produce these data sequence the cDNA, and by switching R1 and R2 we will provide the correct strand orientation to the aligner.<br />
<br />
The command to execute thus becomes:<br />
<br />
<span style="font-family:courier new,courier,monospace;">tophat -o /path/to/outputFolder --library-type fr-firststrand --GTF /path/to/annotation.gtf /path/to/referenceSequenceBowtieIndex /path/to/R2.fastq.gz /path/to/R1.fastq.gz > /path/to/log.txt</span><br />
<br />
== Visualizing the mapped reads ==<br />
<br />
Even though not strictly required, is is a good idea to visualize the mapped reads, so as to better understand the subsequent steps. In this example, the ordering of the read pairs along the reference sequence determines which strand the seuqenced fragment originated from. For genes on the sense strand (i.e. for genes on the "+" strand, with their 5' ends left and their 3' ends right), the forward read (R1) is the first (left-most) read of the read pair. In Tablet, this is visualized with the "green" R1 reads coming before the "blue" R2 reads (of the same read pair):<br />
<br />
[[File:Tablet stranded reads.jpg]]<br />
<br />
(In IGV, the read ordering will be designated as "F1R2"). The R1 reads will be identical to the reference sequence; the R2 reads will be the reverse-complement of the reference sequence. For genes on the anti-sense strand (the "-" strand), the read ordering will be the opposite. In Tablet, the "blue" R2 read of a given read pair will come before (i.e. be situated left of) the "green" R1 read. (In IGV, the read ordering will be designated as "F2R1").<br />
<br />
== Counting reads ==<br />
<br />
=== Using R and the summarizeOverlaps function ===<br />
<br />
If using R, the "<span style="font-family:courier new,courier,monospace;">summarizeOverlaps" </span>method (used in the INFBIO9120 tutorial) will now give correct read counts. This can be verified as follows:<br />
<br />
Using Tablet (or IGV), identify a region with only a few reads, where all the read pairs are present in the same orientation (i.e. transcription is only seen for one strand). Create a test annotation file that spans this region. In the following gff3 example, replace the &lt;ContigName&gt; with the name of the contig containing the selected region, and &lt;from&gt; and &lt;to&gt; with the numeric coordiantes defining the region:<br />
<br />
&lt;ContigName&gt; testSource mRNA &lt;from&gt; &lt;to&gt; . + . ID=ID1;Parent=ID1;Name=ID1<br />
<br />
&lt;ContigName&gt; testSource exon &lt;from&gt; &lt;to&gt; . + . Parent=ID1<br />
<br />
(Use Tablet to make sure the annotation is created correctly)<br />
<br />
The following R code will count the reads contained in the defined region:<br />
<br />
library(ShortRead)<br />
<br />
library(rtracklayer)<br />
<br />
aln <- readGAlignmentPairsFromBam('/path/to/BAMfile.bam', use.names=T)<br />
<br />
genes<-import('/path/to/testAnnotationFile.gff3', asRangedData=FALSE)<br />
<br />
splitgenes<-split(genes,genes$ID)<br />
<br />
genehits<-summarizeOverlaps(splitgenes,aln,mode="Union",singleEnd=FALSE, ignore.strand=FALSE)<br />
<br />
counts<-assays(genehits)$counts<br />
<br />
counts<br />
<br />
This will print the number of reads (of rather, number of fragments) that map to the defined region (it will be zero if the reads stem from the anti-sense strand - the test annotation was created for the "+" strand).<br />
<br />
=== Using HTSeq-count ===<br />
<br />
If using the HTSeq-count python program for read counting, the <span style="font-family:courier new,courier,monospace;">--stranded</span> parameter must be set to "<span style="font-family:courier new,courier,monospace;">yes</span>" to count the stranded data in our example correctly. Alternatively, Tophat can be used with the R1 reads before the R2 reads, followed by import into HTSeq-count using the <span style="font-family:courier new,courier,monospace;">--stranded=reverse</span> parameter (this seems to be the standard way of using HTSeq, according to the creators of HTSeq).</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=File:Tablet_stranded_reads.jpg&diff=220File:Tablet stranded reads.jpg2015-05-26T14:11:13Z<p>Ralfne@uio.no: </p>
<hr />
<div></div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq:_Dealing_with_stranded_sequencing_data&diff=219RNASeq: Dealing with stranded sequencing data2015-05-26T14:09:52Z<p>Ralfne@uio.no: Created page with "= Introduction = If using strand-specific sequencing, it becomes very important to make sure the coverages for the features-of-interest (usually genes) are calculated along..."</p>
<hr />
<div>= Introduction =<br />
<br />
If using strand-specific sequencing, it becomes very important to make sure the coverages for the features-of-interest (usually genes) are calculated along the correct strand. Getting this wrong will mean that the subsequent step of finding differentially expressed genes is finding differentially expressed anti-sense transcripts instead of the actual genes. It may not be easy to identify such a problem at a latter stage, since all subsequent information consists of abstract numbers and identifiers. <br />
<br />
The correct way of counting strand-specific sequencing data depends on: <br />
<br />
*the type of molecule sequenced (for instance, cDNAs) <br />
*the sequencing technology used (for instance, Illumina) <br />
*the particular sequencing kit used (for instance, lluminaTruSeq Stranded Total RNA Sample Prep Kit with paired-end sequencing) <br />
*the settings used when mapping the reads to the reference. For the Tophat aligner, the two critical parameters are --library-type, and the order of passing in the forward and reverse read files. <br />
*the settings used when calculating the read counts.<br />
<br />
= Example =<br />
<br />
The following describes a situation where the Illumina TruSeq kit was used to obtain stranded paired-end RNASeq data genereted from cDNA molecules. <br />
<br />
== Mapping the RNASeq reads to the reference sequence ==<br />
<br />
Since sequencing was done using the TruSeq kit, the Tophat <span style="font-family:courier new,courier,monospace;">--library-type</span> parameter must be set to <span style="font-family:courier new,courier,monospace;">fr-firststrand</span>. Also, the reverse read file (<span style="font-family:courier new,courier,monospace;">R2.fastq.gz</span>) is passed into Tophat before the forward read file (<span style="font-family:courier new,courier,monospace;">R1.fastq.gz</span>). This is according to the INFBIO9120 tutorial, which states: <br />
<br />
Please note that we are providing R2 files first and R1 reads last, although the Tophat2 documentation says we should provide R1 first and R2 second. We do this because the protocol used to produce these data sequence the cDNA, and by switching R1 and R2 we will provide the correct strand orientation to the aligner.<br />
<br />
The command to execute thus becomes:<br />
<br />
<span style="font-family:courier new,courier,monospace;">tophat -o /path/to/outputFolder --library-type fr-firststrand --GTF /path/to/annotation.gtf /path/to/referenceSequenceBowtieIndex /path/to/R2.fastq.gz /path/to/R1.fastq.gz > /path/to/log.txt</span><br />
<br />
== Visualizing the mapped reads ==<br />
<br />
Even though not strictly required, is is a good idea to visualize the mapped reads, so as to better understand the subsequent steps. In this example, the ordering of the read pairs along the reference sequence determines which strand the seuqenced fragment originated from. For genes on the sense strand (i.e. for genes on the "+" strand, with their 5' ends left and their 3' ends right), the forward read (R1) is the first (left-most) read of the read pair. In Tablet, this is visualized with the "green" R1 reads coming before the "blue" R2 reads (of the same read pair):<br />
<br />
<br />
<br />
(In IGV, the read ordering will be designated as "F1R2"). The R1 reads will be identical to the reference sequence; the R2 reads will be the reverse-complement of the reference sequence. For genes on the anti-sense strand (the "-" strand), the read ordering will be the opposite. In Tablet, the "blue" R2 read of a given read pair will come before (i.e. be situated left of) the "green" R1 read. (In IGV, the read ordering will be designated as "F2R1").<br />
<br />
== Counting reads ==<br />
<br />
=== Using R and the summarizeOverlaps function ===<br />
<br />
If using R, the "<span style="font-family:courier new,courier,monospace;">summarizeOverlaps" </span>method (used in the INFBIO9120 tutorial) will now give correct read counts. This can be verified as follows: <br />
<br />
Using Tablet (or IGV), identify a region with only a few reads, where all the read pairs are present in the same orientation (i.e. transcription is only seen for one strand). Create a test annotation file that spans this region. In the following gff3 example, replace the &lt;ContigName&gt; with the name of the contig containing the selected region, and &lt;from&gt; and &lt;to&gt; with the numeric coordiantes defining the region: <br />
<br />
&lt;ContigName&gt; testSource mRNA &lt;from&gt; &lt;to&gt; . + . ID=ID1;Parent=ID1;Name=ID1 <br />
<br />
&lt;ContigName&gt; testSource exon &lt;from&gt; &lt;to&gt; . + . Parent=ID1<br />
<br />
(Use Tablet to make sure the annotation is created correctly) <br />
<br />
The following R code will count the reads contained in the defined region: <br />
<br />
library(ShortRead) <br />
<br />
library(rtracklayer) <br />
<br />
aln <- readGAlignmentPairsFromBam('/path/to/BAMfile.bam', use.names=T) <br />
<br />
genes<-import('/path/to/testAnnotationFile.gff3', asRangedData=FALSE) <br />
<br />
splitgenes<-split(genes,genes$ID) <br />
<br />
genehits<-summarizeOverlaps(splitgenes,aln,mode="Union",singleEnd=FALSE, ignore.strand=FALSE) <br />
<br />
counts<-assays(genehits)$counts <br />
<br />
counts <br />
<br />
This will print the number of reads (of rather, number of fragments) that map to the defined region (it will be zero if the reads stem from the anti-sense strand - the test annotation was created for the "+" strand). <br />
<br />
=== Using HTSeq-count ===<br />
<br />
If using the HTSeq-count python program for read counting, the <span style="font-family:courier new,courier,monospace;">--stranded</span> parameter must be set to "<span style="font-family:courier new,courier,monospace;">yes</span>" to count the stranded data in our example correctly. Alternatively, Tophat can be used with the R1 reads before the R2 reads, followed by import into HTSeq-count using the <span style="font-family:courier new,courier,monospace;">--stranded=reverse</span> parameter (this seems to be the standard way of using HTSeq, according to the creators of HTSeq).</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq_and_differential_gene_expression_analysis&diff=218RNASeq and differential gene expression analysis2015-05-26T14:03:03Z<p>Ralfne@uio.no: </p>
<hr />
<div><div>Differential gene expression analysis using RNASeq implies obtaining RNA sequencing data for the conditions to be compared, mapping the RNA reads to the relevant genome (or transcriptome), counting the read coverage for features-of-interest, and using statistical procedures to infer whether the coverages vary in a systematic and statistical signifincant manner.</div><div><br/></div><div>This section contaIns some technical information for the users of Abel, the UoO high-performance computing cluster. It is not in itself a gene expression analysis tutorial. However, such a tutorial (taken from the UoO course INFBIO9120) is available for download [[Media:INF-BIOx120 RNASeq Analysis.pdf|here]]. This tutorial uses the older "DESeq" R package to do the statistical analysis. The newer "DESeq2" package is used in the following tutorial:</div><div><br/></div><div>[http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2014/140217_AgriOmics/mar_practical.pdf http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2014/140217_AgriOmics/mar_practical.pdf]</div><div><br/></div><br />
*Quality control of sequencing data<br />
*Mapping reads to a genome<br />
*Visualizing mapped reads<br />
*Obtaining the read counts<br />
*Gene expression analysis<br />
*[[RNASeq:_Dealing_with_stranded_sequencing_data|Dealing with stranded sequencing data]]</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq_and_differential_gene_expression_analysis&diff=217RNASeq and differential gene expression analysis2015-05-26T14:02:23Z<p>Ralfne@uio.no: </p>
<hr />
<div><div>Differential gene expression analysis using RNASeq implies obtaining RNA sequencing data for the conditions to be compared, mapping the RNA reads to the relevant genome (or transcriptome), counting the read coverage for features-of-interest, and using statistical procedures to infer whether the coverages vary in a systematic and statistical signifincant manner.</div><div><br/></div><div>This section contaIns some technical information for the users of Abel, the UoO high-performance computing cluster. It is not in itself a gene expression analysis tutorial. However, such a tutorial (taken from the UoO course INFBIO9120) is available for download [[Media:INF-BIOx120 RNASeq Analysis.pdf|here]]. This tutorial uses the older "DESeq" R package to do the statistical analysis. The newer "DESeq2" package is used in the following tutorial:</div><div><br/></div><div>[http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2014/140217_AgriOmics/mar_practical.pdf http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2014/140217_AgriOmics/mar_practical.pdf]</div><div><br/></div><br />
*Quality control of sequencing data<br />
*Mapping reads to a genome<br />
*Visualizing mapped reads<br />
*Obtaining the read counts<br />
*Gene expression analysis<br />
*Dealing with stranded sequencing data</div>Ralfne@uio.nohttps://wiki.uio.no/mn/ibv/bioinfwiki/index.php?title=RNASeq_and_differential_gene_expression_analysis&diff=216RNASeq and differential gene expression analysis2015-05-26T13:59:28Z<p>Ralfne@uio.no: </p>
<hr />
<div><div>Differential gene expression analysis using RNASeq implies obtaining RNA sequencing data for the conditions to be compared, mapping the RNA reads to the relevant genome (or transcriptome), counting the read coverage for features-of-interest, and using statistical procedures to infer whether the coverages vary in a systematic and statistical signifincant manner.</div><div><br/></div><div>This section contaIns some technical information for the users of Abel, the UoO high-performance computing cluster. It is not in itself a gene expression analysis tutorial. However, such a tutorial (taken from the UoO course INFBIO9120) is available for download [[Media:INF-BIOx120 RNASeq Analysis.pdf|here]]. This tutorial uses the older "DESeq" R package to do the statistical analysis. The newer "DESeq2" package is used in the following tutorial:</div><div><br/></div><div>[http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2014/140217_AgriOmics/mar_practical.pdf http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2014/140217_AgriOmics/mar_practical.pdf]</div><div><br/></div><br />
*Quality control of sequencing data<br />
*Mapping reads to a genome<br />
*Visualizing mapped reads<br />
*Obtaining the read counts<br />
*Gene expression analysis</div>Ralfne@uio.no