Variant calling on Abel

From mn/bio/cees-bioinf
Revision as of 22:53, 17 October 2014 by Michamat@uio.no (talk | contribs)

Jump to: navigation, search

Intro

This site describes a number of shell scripts that we have developed to run a variant calling pipeline for population-level resequencing data, which includes GATK's HaplotypeCaller, FreeBayes, and Samtools' mpileup. The pipeline is tailored for use with Abel in the sense that it triggers a large number of slurm scripts in parallel in order to efficiently use Abel's parallel computing power. The aim is to be able to start the entire pipeline with just a number of clicks, but this is still work in progress. Here's what we've got so far.

Preparation

In order to allow the greatest extent of automatization, the pipeline expects standardized names for input sequence files according to the format SAMPLE_LIBRARY_REP.fastq.gz, where 'SAMPLE' should be an identifier for the sampled individual, 'LIBRARY' should be an identifier for the DNA library extracted from that individual, and REP should be an identifier for the mate pair replicate. These three identifiers should be separated with underscores, and there should be no additional underscores in the filename (but can be in the path). All fastq.gz files should sit in the same directory somewhere on /work/users/. For example, /work/users/michaelm/aqua_genome/Working/analysis/data/ looks like this:

$ ls -l /work/users/michaelm/aqua_genome/Working/analysis/data/
-rwx------ 1 michaelm users 1855831335 Jul 30 13:47 L01Y007_L001_R1.fastq.gz
-rwx------ 1 michaelm users 1856424663 Jul 30 13:47 L01Y007_L001_R2.fastq.gz
-rwx------ 1 michaelm users 1837673868 Jul 30 13:47 L01Y007_L002_R1.fastq.gz
-rwx------ 1 michaelm users 1839282994 Jul 30 13:47 L01Y007_L002_R2.fastq.gz
-rwx------ 1 michaelm users 1277786926 Jul 30 13:47 L01Y009_L001_R1.fastq.gz
-rwx------ 1 michaelm users 1278239970 Jul 30 13:47 L01Y009_L001_R2.fastq.gz
-rwx------ 1 michaelm users 1266669523 Jul 30 13:47 L01Y009_L002_R1.fastq.gz
-rwx------ 1 michaelm users 1267661234 Jul 30 13:47 L01Y009_L002_R2.fastq.gz
-rwx------ 1 michaelm users 1397131138 Jul 30 13:47 L01Y013_L001_R1.fastq.gz
-rwx------ 1 michaelm users 1397796506 Jul 30 13:47 L01Y013_L001_R2.fastq.gz
...

In this case, two libraries were prepared for each sample, and mate pairs were sequenced. If you should have for each sample only a single library, or for each library only one sequence file, you should still use file names according to the SAMPLE_LIBRARY_REP.fastq.gz, but you could just use the same identifier for all 'REP's, or for each 'LIBRARY', like this:

$ ls -l /work/users/michaelm/aqua_genome/Working/analysis/data/
-rwx------ 1 michaelm users 1855831335 Jul 30 13:47 L01Y007_L001_R1.fastq.gz
-rwx------ 1 michaelm users 1277786926 Jul 30 13:47 L01Y009_L001_R1.fastq.gz
-rwx------ 1 michaelm users 1397131138 Jul 30 13:47 L01Y013_L001_R1.fastq.gz
...

By running the pipeline, all fastq.gz files in the specified directory will be analysed, so make sure to remove fastq.gz files from the directory if they should not be analysed. Of course, we also need a reference genome fasta file that may sit somewhere else. The collection of pipeline scripts are sitting in /projects/cees/scripts/variant_calling/:

$ ls -l /projects/cees/scripts/variant_calling/
-rw-r--r-- 1 michaelm seq454  4048 Aug  2 19:28 freebayes1.sh
-rwxr-xr-x 1 michaelm seq454   960 Aug  1 17:28 freebayes1.start
-rw-r--r-- 1 michaelm seq454  4513 Aug  2 23:44 freebayes2.sh
-rwxr-xr-x 1 michaelm seq454  1344 Aug  2 20:51 freebayes2.start
-rw-r--r-- 1 michaelm seq454  2478 Aug  2 23:57 freebayes3.sh
-rwxr-xr-x 1 michaelm seq454   730 Aug  1 19:18 freebayes3.start
-rw-r--r-- 1 michaelm seq454  2681 Aug 26 18:57 gatk1.sh
-rwxr-xr-x 1 michaelm seq454  1141 Aug  3 00:10 gatk1.start
-rw-r--r-- 1 michaelm seq454  3077 Aug 26 18:58 gatk2.sh
-rwxr-xr-x 1 michaelm seq454   937 Aug  1 19:38 gatk2.start
-rw-r--r-- 1 michaelm seq454 10594 Aug 25 15:51 map.sh
-rwxr-xr-x 1 michaelm seq454  1792 Aug  3 00:10 map.start
-rw-r--r-- 1 michaelm seq454  9275 Aug 26 11:33 merge.sh
-rwxr-xr-x 1 michaelm seq454  2311 Aug  3 00:10 merge.start
...

The easiest way to use these scripts is by loading the variant_calling module:

$ module load variant_calling

Reference genome indexing: prepare.sh

The reference genome will need to be indexed for mapping with bwa, for picard-tools, and for GATK, and all this is automatically done with scripts prepare.sh and its wrapper script prepare.start. If you've loaded the variant_calling module as described above, you can start prepare.sh by typing

$ prepare.start NOTUR_ACCOUNT REFERENCE

where NOTUR_ACCOUNT should be replaced by the Notur account number to which Abel should bill the needed CPU hours. If you're not sure which Notur account you're on, just type 'projects'. Also, REFERENCE should be replaced by the reference genome file in fasta format (something like ../../../Reference/GenomeScaffolds_with_remaining_large_contigs.fasta). Index files will be written to the same directory as the reference genome fasta file, and script output will be recorded in file prepare.out, which will be written just where you started the script. The output of programs started prepare.sh will be written to file prepare.log, which will be written in the same directory as prepare.out.

Mapping with bwa: map.sh

Once the reference is fully indexed, the next step is to map reads for each library against the reference genome. This is done by script map.sh, which uses bwa mem for mapping, and is started in parallel for each libary with map.start. If mate pair data is available for some or all libraries, these will be used jointly by bwa mem. You can then start parallel mapping by typing

$ map.start NOTUR_ACCOUNT REFERENCE DATA_DIR

Here, the first two arguments, NOTUR_ACCOUNT and REFERENCE are used in the same way as for prepare.start (see above), but a third argument is needed: DATA_DIR should be replaced by the name of the directory that holds all fastq.gz files to be analysed. Depending on how many fastq.gz files you have in your data directory, this script may trigger a (very) large number of parallel runs of map.sh at the same time. If things get out of control, you can always monitor your current Abel runs by typing

$ squeue -u USERNAME

where USERNAME should be replaced with your user name. If you need to cancel current runs, you can do that with

$ scancel RUN_ID

Subsequent to mapping with bwa mem, this script will also perform coordinate sorting with samtools, duplicate marking with picard-tools, indel realignment with GATK, and read group addition based on the standardized input file names, once again using picard-tools. It will write files in format SAMPLE_LIBRARY.sorted.dedup.realn.rgadd.bam (plus the corresponding index files SAMPLE_LIBRARY.sorted.dedup.realn.rgadd.bam.bai) to the specified data directory, and run report files will be saved as map.SAMPLE_LIBRARY.out and map.SAMPLE_LIBRARY.log just where you started the run.

Merging per-library bam files: merge.sh

This step is only really needed if you have data for more than one library per individual, however, you should still run it even if you only have a single library. The script merge.sh combines multiple libraries per individual with picard-tools, and then does a new round of samtools sorting, picard-tools deduplication, and GATK indel realignment on the merged file. Once again, merge.sh can be started in parallel for all samples, using the wrapper script merge.start. The output files will be named SAMPLE.merged.sorted.dedup.realn.bam and saved to the data directory (together with index files named SAMPLE.merged.sorted.dedup.realn.bam.bai). The reason why you should still run merge.start, even if you have only a single library is that downscream scripts within this pipeline expect files named SAMPLE.merged.sorted.dedup.realn.bam in that directory. As a solution, merge.start detects whether multiple libraries are present for each individual, and where that is not the case, it simply copies the file from SAMPLE_LIBRARY.sorted.dedup.realn.rgadd.bam to SAMPLE.merged.sorted.dedup.realn.bam (and SAMPLE_LIBRARY.sorted.dedup.realn.rgadd.bam.bai to SAMPLE.merged.sorted.dedup.realn.bam.bai), without submitting jobs to the Abel queue at all. Just like map.start, you start merge.start with

$ merge.start NOTUR_ACCOUNT REFERENCE DATA_DIR

Run report files will be named merge.SAMPLE.out and merge.SAMPLE.log, and saved to the directory, in which you started the script.

Variant calling with GATK: gatk1.sh and gatk2.sh

Since GATK version 3, variant calling with GATK's HaplotypeCaller is a two-step process, with single-sample calling in what they call gVCF mode, followed by joint genotyping analysis. In order to exploit the parallel processing capacities of Abel as much as we can, GATK variant calling is therefore split into two scripts: gatk1.sh for individual calling, which can be run in parallel for all merged bam files, and gatk2.sh for the joint analysis, which can not be parallelized. Again, gatk1.sh is started via the wrapper script gatk1.start, just like before:

$ gatk1.start NOTUR_ACCOUNT REFERENCE DATA_DIR

This script will produce a vcf file for each input bam file, named just like the input bam file, but with ending 'vcf' instead of 'bam: SAMPLE.merged.sorted.dedup.realn.vcf. These files will be written to the data directory, and run reports will be named gatk1.SAMPLE.out and gatk1.SAMPLE.log, and you will find these as well in the directory, where you started gatk1.start.

The second step of GATK variant calling runs with gatk2.sh, which as usual has its wrapper script gatk2.start, in which the reference and data dir specifications must be changed. After all gatk1.sh runs have finished, type

$ gatk2.start NOTUR_ACCOUNT REFERENCE DATA_DIR

This produces a file called gatk.vcf.gz and its index file gatk.vcf.gz.csi which will both be written to the data directory. This is the unfiltered variant call result, which can then be used for filtering and extraction of either SNPs or indels. Run report files named gatk2.out and gatk2.log will be written to the directory where you started the run.

Variant calling with FreeBayes: freebayes1.sh-freebayes3.sh

FreeBayes is a bit more tricky to use than GATK, since the program does not include a per-individual step that can be parallelized with Abel, and thus can take longer than the 1-week run time limit on Abel if you have lots of individuals and FreeBayes is run the standard way. The solution here is to first split the reference genome into different regions that can be analysed independently and in parallel by FreeBayes. This requires some logistics though. First, we need a script that splits the reference genome into multiple regions, and at the same time produces a header for the final variant call file, this is done by freebayes1.sh and freebayes1.start. Next, we can run FreeBayes in parallel for each region in order to save time, and this is achieved by freebayes2.sh and freebayes2.start. Finally, the vcf result files for the individual regions should be combined with the joint vcf header that was produced in the first step, to form the combined vcf file freebayes.vcf. This last step is done by freebayes3.sh and freebayes3.start.

So we go

$ freebayes1.start NOTUR_ACCOUNT REFERENCE DATA_DIR

to produce regions files that link to different parts of the reference genome, and to produce a file for the vcf header. By default, freebayes1.sh produces 24 regions files, meaning 24 individual runs of freebayes2.sh, and a reduction of run time by roughly 24. Regions files will appear in the data directory, and will be named freebayes.region1.txt to freebayes.region24.txt. The vcf header file will appear in the same directory and be named freebayes.header.vcf. Run reports named freebayes1.out and freebayes1.log will be written to the data directory.

Once all the region files are ready, we can start freebayes2.sh by

$ freebayes2.start NOTUR_ACCOUNT REFERENCE DATA_DIR

This starts parallel runs of freebayes2.sh for each individual region file. Behind the scenes, each individual run is multi-threaded for faster run times. Following variant calling with FreeBayes, the resulting vcf files need to be sorted, which is done with vcftools. Region-specific sorted vcf files will be written to the data directory, and run reports named freebayes2.REGION.txt and freebayes2.REGION.log will be written to where you started the run.

Finally, when all per-region vcf files were written to the data directory, we use freebayes3.sh to combine and sort them. This time the reference does not need to be specified:

$ freebayes3.start NOTUR_ACCOUNT DATA_DIR

This will write the joint but unfiltered variant calling result file freebayes.vcf.gz and the respective index file freebayes.vcf.gz.csi, again to the data directory, and run reports named freebayes3.out and freebayes3.log will be written to the directory, in which the script was started.

Variant calling with Samtools' mpileup: mpileup1.sh

As a third alternative, we're using Samtools' plain vanilla variant caller, mpileup, run by mpileup1.sh, which in turn is started with mpileup1.start:

$ mpileup1.start NOTUR_ACCOUNT REFERENCE DATA_DIR

Since so far, mpileup performed reasonably fast with the number of individuals tested, mpileup1.sh does not parallelize by splitting between individuals or regions, but this could possibly be implemented if need arises. It writes files mpileup.vcf.gz and mpileup.vcf.gz.csi to the data directory and run reports mpileup1.out and mpileup1.log to where you started the run.

Filtering SNPs from the three vcf files: filter_snps.sh

After GATK, FreeBayes, and Samtools' mpileup have finished, we should have three compressed vcf files, and the respective index files in the data directory: gatk.vcf.gz and gatk.vcf.gz.csi, freebayes.vcf.gz and freebayes.vcf.gz.csi, and mpileup.vcf.gz and mpileup.vcf.gz.csi. These are pretty much unfiltered, but filtering should be done before any downstream analyses to reduce the number of erroneous variant calls. Depending on the intended analysis, different levels of filtering can be optimal, but the goal here is to produce a consensus way of filtering that works well for most datasets. Note that SNPs with low minor allele frequencies are not filtered, as this type of filtering can always be done at later stages, and should be avoided for certain types of analyses (for example, it generates bias in coalescent-based analyses; McGill et al. 2013). Here, we first apply an initial set of filters that depend on whether the vcf file was produced with GATK, FreeBayes, or mpileup, and extract biallelic SNPs from each file. Second, we take the intersection of SNPs found with the three different variant callers, assuming that these represent a set of particularly reliable SNPs. And third, we apply a hard filter on quality and read depth to the set of SNPs found with all three tools. The first two steps are implemented in script filter_snps.sh, while the second and third step are performed with script intersect.sh (see below). The script filter_snps.sh uses bcftools to apply the following filters to the vcf files produced with the three variant callers (SNPs matching any of these criteria are excluded):

GATK: FS > 60.0, MQRankSum < -12.5, ReadPosRankSum < -8.0, QD < 2, RMSMappingQuality < 40 (this follows recommendations given here).
FreeBayes: No filters applied.
mpileup: RPB < 0.0001, MQB < 0.0001, BQB < 0.0001, MQSB < 0.0001.
In addition, all SNPs within 10 sites of an indel are removed from all three SNP sets, as these could be due to erroneous mapping.

Running the wrapper script filter_snps.start, the script filter_snps.sh is automatically started three times with the three different settings and the respective input files:

$ filter_snps.start NOTUR_ACCOUNT DATA_DIR

Since so far, mpileup performed reasonably fast with the number of individuals tested, mpileup1.sh does not parallelize by splitting between individuals or regions, but this could possibly be implemented if need arises. It writes files mpileup.vcf.gz and mpileup.vcf.gz.csi to the data directory and run reports mpileup1.out and mpileup1.log to where you started the run.


Summary

In theory, running this variant calling pipeline to obtain gatk.vcf, freebayes.vcf, and mpileup.vcf should be as easy as this:

$ module load variant_calling
$ prepare.start NOTUR_ACCOUNT REFERENCE
# lean back and wait until prepare.sh has finished (minutes)

$ map.start NOTUR_ACCOUNT REFERENCE DATA_DIR
# have long lunch and wait until map.sh has finished (hours)

$ merge.start NOTUR_ACCOUNT REFERENCE DATA_DIR
# have another long lunch and wait until merge.sh has finished (hours)

$ gatk1.start NOTUR_ACCOUNT REFERENCE DATA_DIR
$ freebayes1.start NOTUR_ACCOUNT REFERENCE DATA_DIR
$ mpileup1.start NOTUR_ACCOUNT REFERENCE DATA_DIR
# go home and come back when at least gatk1.sh and freebayes1.sh have finished (hours/days)

$ gatk2.start NOTUR_ACCOUNT REFERENCE DATA_DIR
$ freebayes2.start NOTUR_ACCOUNT REFERENCE DATA_DIR
# go on vacation until both gatk2.sh and freebayes2.sh have finished (days)

$ freebayes3.start NOTUR_ACCOUNT DATA_DIR
# should be done within minutes

$ filter_snps.start NOTUR_ACCOUNT DATA_DIR
# again, minutes

Even more theoretically, it should be possible to write a script that does all that for you:

$ variant_calling.start NOTUR_ACCOUNT REFERENCE DATA_DIR
# go on long vacations...

but unfortunately I haven't figure this out yet (before going on vacations).

In practice, it might need some tweaking. Let me know if you need help!