Variant calling on Abel

From mn/bio/cees-bioinf
Revision as of 11:01, 1 August 2014 by Michamat@uio.no (talk | contribs) (Preparation)

Jump to: navigation, search

Intro

This site describes a number of shell scripts that we have developed to run a variant calling pipeline for population-level resequencing data, which includes GATK's HaplotypeCaller, FreeBayes, and Samtools' mpileup. The pipeline is tailored for use with Abel in the sense that it triggers a large number of slurm scripts in parallel in order to efficiently use Abel's parallel computing power. The aim is to be able to start the entire pipeline with just a number of clicks, but this is still work in progress. Here's what we've got so far.

Preparation

In order to allow the greatest extent of automatization, the pipeline expects standardized names for input sequence files according to the format SAMPLE_LIBRARY_REP.fastq.gz, where 'SAMPLE' should be an identifier for the sampled individual, 'LIBRARY' should be an identifier for the DNA library extracted from that individual, and REP should be an identifier for the mate pair replicate. These three identifiers should be separated with underscores, and there should be no additional underscores in the filename (but can be in the path). All fastq.gz files should sit in the same directory somewhere on /work/users/. For example, /work/users/michaelm/aqua_genome/Working/analysis/data/ looks like this:

ls -l /work/users/michaelm/aqua_genome/Working/analysis/data/
-rwx------ 1 michaelm users 1855831335 Jul 30 13:47 L01Y007_L001_R1.fastq.gz
-rwx------ 1 michaelm users 1856424663 Jul 30 13:47 L01Y007_L001_R2.fastq.gz
-rwx------ 1 michaelm users 1837673868 Jul 30 13:47 L01Y007_L002_R1.fastq.gz
-rwx------ 1 michaelm users 1839282994 Jul 30 13:47 L01Y007_L002_R2.fastq.gz
-rwx------ 1 michaelm users 1277786926 Jul 30 13:47 L01Y009_L001_R1.fastq.gz
-rwx------ 1 michaelm users 1278239970 Jul 30 13:47 L01Y009_L001_R2.fastq.gz
-rwx------ 1 michaelm users 1266669523 Jul 30 13:47 L01Y009_L002_R1.fastq.gz
-rwx------ 1 michaelm users 1267661234 Jul 30 13:47 L01Y009_L002_R2.fastq.gz
-rwx------ 1 michaelm users 1397131138 Jul 30 13:47 L01Y013_L001_R1.fastq.gz
-rwx------ 1 michaelm users 1397796506 Jul 30 13:47 L01Y013_L001_R2.fastq.gz
...

In this case, two libraries were prepared for each sample, and mate pairs were sequenced. If you should have for each sample only a single library, or for each library only one sequence file, you should still use file names according to the SAMPLE_LIBRARY_REP.fastq.gz, but you could just use the same identifier for all 'REP's, or for each 'LIBRARY', like this:

ls -l /work/users/michaelm/aqua_genome/Working/analysis/data/
-rwx------ 1 michaelm users 1855831335 Jul 30 13:47 L01Y007_L001_R1.fastq.gz
-rwx------ 1 michaelm users 1277786926 Jul 30 13:47 L01Y009_L001_R1.fastq.gz
-rwx------ 1 michaelm users 1397131138 Jul 30 13:47 L01Y013_L001_R1.fastq.gz
...

By running the pipeline, all fastq.gz files in the specified directory will be analysed, so make sure to remove fastq.gz files from the directory if they should not be analysed. Of course, we also need a reference genome fasta file that may sit somewhere else. The collection of pipeline scripts are sitting in /projects/cees/scripts/variant_calling/:

ls -l /projects/cees/scripts/variant_calling/
-rw-r--r-- 1 michaelm seq454  2651 Aug  1 10:50 gatki.sh
-rwxr-xr-x 1 michaelm seq454  1100 Aug  1 10:50 gatki.start
-rw-r--r-- 1 michaelm seq454  3052 Aug  1 10:51 gatkj.sh
-rwxr-xr-x 1 michaelm seq454   908 Aug  1 10:51 gatkj.start
-rw-r--r-- 1 michaelm seq454 10144 Aug  1 10:40 map.sh
-rwxr-xr-x 1 michaelm seq454  1744 Aug  1 10:45 map.start
-rw-r--r-- 1 michaelm seq454  8717 Aug  1 10:45 merge.sh
-rwxr-xr-x 1 michaelm seq454  2167 Aug  1 10:46 merge.start
-rw-r--r-- 1 michaelm seq454  3316 Aug  1 10:40 prepare.sh
-rwxr-xr-x 1 michaelm seq454   561 Aug  1 10:38 prepare.start
...

The easiest way to use these script is by loading the variant_calling module:

module load variant_calling

Reference genome indexing: prepare.sh

The reference genome will need to be indexed for mapping with bwa, for picard-tools, and for GATK, and all this is automatically done with scripts prepare.sh and prepare.start. After you've copied both scripts to somewhere in your personal user directory, you should edit prepare.start with your favorite unix editor (emacs, nano, etc) to specify a path to your reference genome fasta file on line 9:

$ cat prepare.start
#!/bin/bash

# Michael Matschiner, 2014-07-31
#
# This script is a wrapper around prepare.sh.

# Replace the next variable with the path plus filename of the reference genome
# sequence in fasta format.
ref_with_relative_path=../../../Reference/GenomeScaffolds_with_remaining_large_contigs.fasta

# Start preparing the reference.
sbatch -o prepare.out prepare.sh $ref_with_relative_path

You can then start the script by typing

$ ./prepare.start

Index files will be written to the same directory as the reference genome fasta file, and script output will be recorded in file prepare.out, which will be written just where you started the script.

Mapping with bwa: map.sh

Each sampled individual and library will be mapped against the reference genome with bwa mem by script map.sh, and all these runs are started in parallel with map.start. If mate pair data is available, these will be used jointly by bwa mem. Again, you'll have to modify map.start, by changing the reference file and data directory specifications near the top of the script, on lines 9 and 13:

$ cat map.start
#!/bin/bash

# Michael Matschiner, 2014-07-31
#
# This script is a wrapper around map.sh.

# Replace the next variable with the path plus filename of the reference genome
# sequence in fasta format.
ref_with_relative_path=../../../Reference/GenomeScaffolds_with_remaining_large_contigs.fasta

# Replace the next variable with the name of the directory that contains your
# fastq.gz sequence files.
data_dir=../data/
...

No further changes should be necessary. You can then start parallel mapping by typing

$ ./map.start

Depending on how many fastq.gz files you have in your data directory, this may trigger a large number of parallel runs of map.sh at the same time. Subsequent to mapping with bwa mem, this script will also perform coordinate sorting with samtools, duplicate marking with picard-tools, indel realignment with GATK, and read group addition based on the standardized input file names, once again using picard-tools.

Merging per-library bam files: merge.sh

This step is only needed if you have data for more than one library per individual. The script merge.sh combines these with picard-tools, and then does a new round of samtools sorting, picard-tools deduplication, and GATK indel realignment on the merged file. Once again, merge.sh can be started in parallel for all samples, using the wrapper script merge.start after changing the same two parameters in merge.start as in map.start (see above), on lines 9 and 13.

$ ./merge.start

Variant calling with GATK: gatki.sh and gatkj.sh

Since GATK version 3, variant calling with this tool is a two-step process, with single-sample calling in gVCF mode followed by joint genotyping analysis. In order to exploit the parallel processing capacities of Abel as much as we can, GATK variant calling is therefore split into two scripts: gatki.sh for individual calling, which can be run in parallel for all merged bam files, and gatkj for the joint analysis, which can not be parallelized. Again, gatki.sh is started via the wrapper script gatki.start, after reference and data directory are correctly specified on lines 9 and 13 of that file:

$ ./gatki.start

This script will produce a vcf file for each input bam file, named just like the input bam file, but with ending 'vcf' instead of 'bam'. This file will be copied to the data directory.

The second step of GATK variant calling runs with gatkj.sh, which as usual has its wrapper script gatkj.start, in which the reference and data dir specifications must be changed.

$ ./gatkj.start

This produces a single file called gatk.vcf which will be copied to the data directory. This is the unfiltered variant call result, which can then be used for filtering and extraction of either SNPs or indels.

Variant calling with FreeBayes: freebayes.sh

TO BE WRITTEN

Variant calling with Samtools' mpileup: mpileup.sh

TO BE WRITTEN