Difference between revisions of "Variant calling on Abel"

From mn/bio/cees-bioinf
Jump to: navigation, search
(Per-individual SNP calling with GATK: gatki.sh)
Line 90: Line 90:
  
 
==Per-individual SNP calling with GATK: gatki.sh==
 
==Per-individual SNP calling with GATK: gatki.sh==
TO BE WRITTEN
+
Since GATK version 3, variant calling with this tool is a two-step process, with single-sample calling in gVCF mode followed by joint genotyping analysis. In order to exploit the parallel processing capacities of Abel as much as we can, gatk variant calling is therefore split into two scripts: gatki.sh for individual calling, which can be run in parallel for all merged bam files, and gatkj for the joint analysis, which can not be parallelized. Again, gatki.sh is started via the wrapper script gatki.start, after reference and data directory are correctly specified on lines 9 and 13 of that file:
 +
<pre>
 +
$ ./gatki.start
 +
</pre>
 +
This script will produce a vcf file named after the input bam file, as SAMPLE.merged.sorted.dedup.realn.vcf, and this file will be copied to the data directory.
  
 
==Joint SNP calling with GATK: gatkj.sh==
 
==Joint SNP calling with GATK: gatkj.sh==

Revision as of 23:01, 31 July 2014

Intro

This site describes a number of shell scripts that we have developed to run a SNP calling pipeline for population-level resequencing data, which includes GATK's HaplotypeCaller, FreeBayes, and Samtools' mpileup. The pipeline is tailored for use with Abel in the sense that it triggers a large number of slurm scripts in parallel in order to efficiently use Abel's parallel computing power. The aim is to be able to start the entire pipeline with just a number of clicks, but this is still work in progress. Here's what we've got so far.

Preparation

In order to allow the greatest extent of automatization, the pipeline expects standardized variable names for input sequence files according to the format SAMPLE_LIBRARY_REP.fastq.gz, where 'SAMPLE' should be an identifier for the sampled individual, 'LIBRARY' should be an identifier for the DNA library extracted from that individual, and REP should be an identifier for the mate pair replicate, if applicable. These three identifiers should be separated with underscores, and there should be no additional underscores in the filename (but can be in the path). All fastq.gz files should sit in the same directory somewhere on /work/users/. For example, /work/users/michaelm/aqua_genome/Working/analysis/data/ looks like this:

ls -l /work/users/michaelm/aqua_genome/Working/analysis/data/
-rwx------ 1 michaelm users 1855831335 Jul 30 13:47 L01Y007_L001_R1.fastq.gz
-rwx------ 1 michaelm users 1856424663 Jul 30 13:47 L01Y007_L001_R2.fastq.gz
-rwx------ 1 michaelm users 1837673868 Jul 30 13:47 L01Y007_L002_R1.fastq.gz
-rwx------ 1 michaelm users 1839282994 Jul 30 13:47 L01Y007_L002_R2.fastq.gz
-rwx------ 1 michaelm users 1277786926 Jul 30 13:47 L01Y009_L001_R1.fastq.gz
-rwx------ 1 michaelm users 1278239970 Jul 30 13:47 L01Y009_L001_R2.fastq.gz
-rwx------ 1 michaelm users 1266669523 Jul 30 13:47 L01Y009_L002_R1.fastq.gz
-rwx------ 1 michaelm users 1267661234 Jul 30 13:47 L01Y009_L002_R2.fastq.gz
-rwx------ 1 michaelm users 1397131138 Jul 30 13:47 L01Y013_L001_R1.fastq.gz
-rwx------ 1 michaelm users 1397796506 Jul 30 13:47 L01Y013_L001_R2.fastq.gz
...

In this case, two libraries were prepared for each sample, and mate pairs were sequenced. If you should have for each sample only a single library, or for each library only one sequence file, you should still use file names according to the SAMPLE_LIBRARY_REP.fastq.gz, but you could just use the same identifier for all 'REP's, or for each 'LIBRARY'. By running the pipeline, all fastq.gz files in the specified directory will be analysed, so make sure to remove fastq.gz files from the directory if they should not be analysed. Of course, we also need a reference genome fasta file that may sit somewhere else. And the collection of pipeline scripts of course, which are sitting in /projects/cees/scripts/snp_calling/ and are discussed below:

ls -l /projects/cees/scripts/snp_calling/
-rw-r--r-- 1 michaelm seq454 9957 Jul 31 19:34 map.sh
-rwxr-xr-x 1 michaelm seq454 1679 Jul 31 19:34 map.start
-rw-r--r-- 1 michaelm seq454 8768 Jul 31 19:34 merge.sh
-rwxr-xr-x 1 michaelm seq454 1525 Jul 31 19:34 merge.start
-rw-r--r-- 1 michaelm seq454 3355 Jul 31 19:34 prepare.sh
-rwxr-xr-x 1 michaelm seq454  386 Jul 31 19:34 prepare.start
...

I recommend that the collection of scripts should be copied into a user-specific directory, since you will need to make limited changes to the scripts. Note: Be aware that all these scripts by default use Notur account nn9244k, which is specified in each .sh file. If you have access to a different account, please change this in the script!

Reference genome indexing: prepare.sh

The reference genome will need to be indexed for mapping with bwa, for picard-tools, and for GATK, and all this is automatically done with scripts prepare.sh and prepare.start. After you've copied both scripts to somewhere in your personal user directory, you should edit prepare.start with your favorite unix editor (emacs, nano, etc) to specify a path to your reference genome fasta file on line 9:

$ cat prepare.start
#!/bin/bash

# Michael Matschiner, 2014-07-31
#
# This script is a wrapper around prepare.sh.

# Replace the next variable with the path plus filename of the reference genome
# sequence in fasta format.
ref_with_relative_path=../../../Reference/GenomeScaffolds_with_remaining_large_contigs.fasta

# Start preparing the reference.
sbatch -o prepare.out prepare.sh $ref_with_relative_path

You can then start the script by typing

$ ./prepare.start

Index files will be written to the same directory as the reference genome fasta file, and script output will be recorded in file prepare.out, which will be written just where you started the script.

Mapping with bwa: map.sh

Each sampled individual and library will be mapped against the reference genome with bwa mem by script map.sh, and all these runs are started in parallel with map.start. If mate pair data is available, these will be used jointly by bwa mem. Again, you'll have to modify map.start, by changing the reference file and data directory specifications near the top of the script, on lines 9 and 13:

$ cat map.start
#!/bin/bash

# Michael Matschiner, 2014-07-31
#
# This script is a wrapper around map.sh.

# Replace the next variable with the path plus filename of the reference genome
# sequence in fasta format.
ref_with_relative_path=../../../Reference/GenomeScaffolds_with_remaining_large_contigs.fasta

# Replace the next variable with the name of the directory that contains your
# fastq.gz sequence files.
data_dir=../data/
...

No further changes should be necessary. You can then start parallel mapping by typing

$ ./map.start

Depending on how many fastq.gz files you have in your data directory, this may trigger a large number of parallel runs of map.sh at the same time. Subsequent to mapping with bwa mem, this script will also perform coordinate sorting with samtools, duplicate marking with picard-tools, indel realignment with GATK, and read group addition based on the standardized input file names, once again using picard-tools.

Merging per-library bam files: merge.sh

This step is only needed if you have data for more than one library per individual. The script merge.sh combines these with picard-tools, and then does a new round of samtools sorting, picard-tools deduplication, and GATK indel realignment on the merged file. Once again, merge.sh can be started in parallel for all samples, using the wrapper script merge.start after changing the same two parameters in merge.start as in map.start (see above), on lines 9 and 13.

$ ./merge.start

Per-individual SNP calling with GATK: gatki.sh

Since GATK version 3, variant calling with this tool is a two-step process, with single-sample calling in gVCF mode followed by joint genotyping analysis. In order to exploit the parallel processing capacities of Abel as much as we can, gatk variant calling is therefore split into two scripts: gatki.sh for individual calling, which can be run in parallel for all merged bam files, and gatkj for the joint analysis, which can not be parallelized. Again, gatki.sh is started via the wrapper script gatki.start, after reference and data directory are correctly specified on lines 9 and 13 of that file:

$ ./gatki.start

This script will produce a vcf file named after the input bam file, as SAMPLE.merged.sorted.dedup.realn.vcf, and this file will be copied to the data directory.

Joint SNP calling with GATK: gatkj.sh

TO BE WRITTEN

Joint SNP calling with FreeBayes: freebayes.sh

TO BE WRITTEN

Joint SNP calling with Samtools' mpileup: mpileup.sh

TO BE WRITTEN