Difference between revisions of "Variant calling on Abel"

From mn/bio/cees-bioinf
Jump to: navigation, search
(Created page with "==Software installation== Programs that are used by a large number of users at CEES are generally maintained by our IT support, who also creates 'modules' for each program ins...")
 
Line 1: Line 1:
==Software installation==
+
==Intro==
Programs that are used by a large number of users at CEES are generally maintained by our IT support, who also creates 'modules' for each program installed in /cluster/software. On abel and the cod nodes, the standard way to use software is then to load them as modules with e.g.
+
This site describes a number of shell scripts that we have developed to run a SNP calling pipeline for population-level resequencing data, which includes [http://www.broadinstitute.org/gatk/ GATK]'s HaplotypeCaller, [https://github.com/ekg/freebayes FreeBayes], and [http://samtools.sourceforge.net Samtools]' [http://samtools.sourceforge.net/mpileup.shtml mpileup]. The pipeline is tailored for use with Abel in the sense that it triggers a large number of slurm scripts in parallel in order to efficiently use Abel's parallel computing power. The aim is to be able to start the entire pipeline with just a number of clicks, but this is still work in progress. Here's what we've got so far.
<pre>module load samtools</pre>
+
 
The module system allows for different versions of the software. For example, writing
+
==Preparation==
<pre>module avail samtools</pre>
+
In order to allow the greatest extent of automatization, the pipeline expects standardized variable names for input sequence files according to the format SAMPLE_LIBRARY_REP.fastq.gz, where 'SAMPLE' should be an identifier for the sampled individual, 'LIBRARY' should be an identifier for the DNA library extracted from that individual, and REP should be an identifier for the mate pair replicate, if applicable. All fastq.gz files should sit in the same directory somewhere on /work/users/. For example, /work/users/michaelm/aqua_genome/Working/analysis/data/ looks like this
will list the available versions:
+
<pre>
<pre>samtools/0.1.18(default)
+
ls -l /work/users/michaelm/aqua_genome/Working/analysis/data/
samtools/0.1.19</pre>
+
-rwx------ 1 michaelm users 1855831335 Jul 30 13:47 L01Y007_L001_R1.fastq.gz
 +
-rwx------ 1 michaelm users 1856424663 Jul 30 13:47 L01Y007_L001_R2.fastq.gz
 +
-rwx------ 1 michaelm users 1837673868 Jul 30 13:47 L01Y007_L002_R1.fastq.gz
 +
-rwx------ 1 michaelm users 1839282994 Jul 30 13:47 L01Y007_L002_R2.fastq.gz
 +
-rwx------ 1 michaelm users 1277786926 Jul 30 13:47 L01Y009_L001_R1.fastq.gz
 +
-rwx------ 1 michaelm users 1278239970 Jul 30 13:47 L01Y009_L001_R2.fastq.gz
 +
-rwx------ 1 michaelm users 1266669523 Jul 30 13:47 L01Y009_L002_R1.fastq.gz
 +
-rwx------ 1 michaelm users 1267661234 Jul 30 13:47 L01Y009_L002_R2.fastq.gz
 +
-rwx------ 1 michaelm users 1397131138 Jul 30 13:47 L01Y013_L001_R1.fastq.gz
 +
-rwx------ 1 michaelm users 1397796506 Jul 30 13:47 L01Y013_L001_R2.fastq.gz
 +
...
 +
</pre>

Revision as of 19:24, 31 July 2014

Intro

This site describes a number of shell scripts that we have developed to run a SNP calling pipeline for population-level resequencing data, which includes GATK's HaplotypeCaller, FreeBayes, and Samtools' mpileup. The pipeline is tailored for use with Abel in the sense that it triggers a large number of slurm scripts in parallel in order to efficiently use Abel's parallel computing power. The aim is to be able to start the entire pipeline with just a number of clicks, but this is still work in progress. Here's what we've got so far.

Preparation

In order to allow the greatest extent of automatization, the pipeline expects standardized variable names for input sequence files according to the format SAMPLE_LIBRARY_REP.fastq.gz, where 'SAMPLE' should be an identifier for the sampled individual, 'LIBRARY' should be an identifier for the DNA library extracted from that individual, and REP should be an identifier for the mate pair replicate, if applicable. All fastq.gz files should sit in the same directory somewhere on /work/users/. For example, /work/users/michaelm/aqua_genome/Working/analysis/data/ looks like this

ls -l /work/users/michaelm/aqua_genome/Working/analysis/data/
-rwx------ 1 michaelm users 1855831335 Jul 30 13:47 L01Y007_L001_R1.fastq.gz
-rwx------ 1 michaelm users 1856424663 Jul 30 13:47 L01Y007_L001_R2.fastq.gz
-rwx------ 1 michaelm users 1837673868 Jul 30 13:47 L01Y007_L002_R1.fastq.gz
-rwx------ 1 michaelm users 1839282994 Jul 30 13:47 L01Y007_L002_R2.fastq.gz
-rwx------ 1 michaelm users 1277786926 Jul 30 13:47 L01Y009_L001_R1.fastq.gz
-rwx------ 1 michaelm users 1278239970 Jul 30 13:47 L01Y009_L001_R2.fastq.gz
-rwx------ 1 michaelm users 1266669523 Jul 30 13:47 L01Y009_L002_R1.fastq.gz
-rwx------ 1 michaelm users 1267661234 Jul 30 13:47 L01Y009_L002_R2.fastq.gz
-rwx------ 1 michaelm users 1397131138 Jul 30 13:47 L01Y013_L001_R1.fastq.gz
-rwx------ 1 michaelm users 1397796506 Jul 30 13:47 L01Y013_L001_R2.fastq.gz
...