Variant calling on Abel
- 1 Intro
- 2 Preparation
- 3 Reference genome indexing: prepare.sh
- 4 Mapping with bwa: map.sh
- 5 Merging per-library bam files: merge.sh
- 6 Variant calling with GATK: gatk1.sh and gatk2.sh
- 7 Variant calling with FreeBayes: freebayes1.sh-freebayes3.sh
- 8 Variant calling with Samtools' mpileup: mpileup1.sh
- 9 Filtering SNPs from the three vcf files: filter_snps.sh
- 10 Taking the intersection of the three vcf files: intersect_snps.sh
- 11 Summary
This site describes a number of shell scripts that we have developed to run a variant calling pipeline for population-level resequencing data, which includes GATK's HaplotypeCaller, FreeBayes, and Samtools' mpileup. The pipeline is tailored for use with Abel in the sense that it triggers a large number of slurm scripts in parallel in order to efficiently use Abel's parallel computing power. The aim is to be able to start the entire pipeline with just a number of clicks, but this is still work in progress. Here's what we've got so far.
In order to allow the greatest extent of automatization, the pipeline expects standardized names for input sequence files according to the format SAMPLE_LIBRARY_REP.fastq.gz, where 'SAMPLE' should be an identifier for the sampled individual, 'LIBRARY' should be an identifier for the DNA library extracted from that individual, and REP should be an identifier for the mate pair replicate. These three identifiers should be separated with underscores, and there should be no additional underscores in the filename (but can be in the path). All fastq.gz files should sit in the same directory somewhere on /work/users/. For example, /work/users/michaelm/aqua_genome/Working/analysis/data/ looks like this:
$ ls -l /work/users/michaelm/aqua_genome/Working/analysis/data/ -rwx------ 1 michaelm users 1855831335 Jul 30 13:47 L01Y007_L001_R1.fastq.gz -rwx------ 1 michaelm users 1856424663 Jul 30 13:47 L01Y007_L001_R2.fastq.gz -rwx------ 1 michaelm users 1837673868 Jul 30 13:47 L01Y007_L002_R1.fastq.gz -rwx------ 1 michaelm users 1839282994 Jul 30 13:47 L01Y007_L002_R2.fastq.gz -rwx------ 1 michaelm users 1277786926 Jul 30 13:47 L01Y009_L001_R1.fastq.gz -rwx------ 1 michaelm users 1278239970 Jul 30 13:47 L01Y009_L001_R2.fastq.gz -rwx------ 1 michaelm users 1266669523 Jul 30 13:47 L01Y009_L002_R1.fastq.gz -rwx------ 1 michaelm users 1267661234 Jul 30 13:47 L01Y009_L002_R2.fastq.gz -rwx------ 1 michaelm users 1397131138 Jul 30 13:47 L01Y013_L001_R1.fastq.gz -rwx------ 1 michaelm users 1397796506 Jul 30 13:47 L01Y013_L001_R2.fastq.gz ...
In this case, two libraries were prepared for each sample, and mate pairs were sequenced. If you should have for each sample only a single library, or for each library only one sequence file, you should still use file names according to the SAMPLE_LIBRARY_REP.fastq.gz, but you could just use the same identifier for all 'REP's, or for each 'LIBRARY', like this:
$ ls -l /work/users/michaelm/aqua_genome/Working/analysis/data/ -rwx------ 1 michaelm users 1855831335 Jul 30 13:47 L01Y007_L001_R1.fastq.gz -rwx------ 1 michaelm users 1277786926 Jul 30 13:47 L01Y009_L001_R1.fastq.gz -rwx------ 1 michaelm users 1397131138 Jul 30 13:47 L01Y013_L001_R1.fastq.gz ...
By running the pipeline, all fastq.gz files in the specified directory will be analysed, so make sure to remove fastq.gz files from the directory if they should not be analysed. Of course, we also need a reference genome fasta file that may sit somewhere else. The collection of pipeline scripts are sitting in /projects/cees/scripts/variant_calling/:
$ ls -l /projects/cees/scripts/variant_calling/ -rw-r--r-- 1 michaelm seq454 4048 Aug 2 19:28 freebayes1.sh -rwxr-xr-x 1 michaelm seq454 960 Aug 1 17:28 freebayes1.start -rw-r--r-- 1 michaelm seq454 4513 Aug 2 23:44 freebayes2.sh -rwxr-xr-x 1 michaelm seq454 1344 Aug 2 20:51 freebayes2.start -rw-r--r-- 1 michaelm seq454 2478 Aug 2 23:57 freebayes3.sh -rwxr-xr-x 1 michaelm seq454 730 Aug 1 19:18 freebayes3.start -rw-r--r-- 1 michaelm seq454 2681 Aug 26 18:57 gatk1.sh -rwxr-xr-x 1 michaelm seq454 1141 Aug 3 00:10 gatk1.start -rw-r--r-- 1 michaelm seq454 3077 Aug 26 18:58 gatk2.sh -rwxr-xr-x 1 michaelm seq454 937 Aug 1 19:38 gatk2.start -rw-r--r-- 1 michaelm seq454 10594 Aug 25 15:51 map.sh -rwxr-xr-x 1 michaelm seq454 1792 Aug 3 00:10 map.start -rw-r--r-- 1 michaelm seq454 9275 Aug 26 11:33 merge.sh -rwxr-xr-x 1 michaelm seq454 2311 Aug 3 00:10 merge.start ...
The easiest way to use these scripts is by loading the variant_calling module:
$ module load variant_calling
Reference genome indexing: prepare.sh
The reference genome will need to be indexed for mapping with bwa, for picard-tools, and for GATK, and all this is automatically done with scripts prepare.sh and its wrapper script prepare.start. If you've loaded the variant_calling module as described above, you can start prepare.sh by typing
$ prepare.start NOTUR_ACCOUNT REFERENCE
where NOTUR_ACCOUNT should be replaced by the Notur account number to which Abel should bill the needed CPU hours. If you're not sure which Notur account you're on, just type 'projects'. Also, REFERENCE should be replaced by the reference genome file in fasta format (something like ../../../Reference/GenomeScaffolds_with_remaining_large_contigs.fasta). Index files will be written to the same directory as the reference genome fasta file, and script output will be recorded in file prepare.out, which will be written just where you started the script. The output of programs started prepare.sh will be written to file prepare.log, which will be written in the same directory as prepare.out.
Mapping with bwa: map.sh
Once the reference is fully indexed, the next step is to map reads for each library against the reference genome. This is done by script map.sh, which uses bwa mem for mapping, and is started in parallel for each libary with map.start. If mate pair data is available for some or all libraries, these will be used jointly by bwa mem. You can then start parallel mapping by typing
$ map.start NOTUR_ACCOUNT REFERENCE DATA_DIR
Here, the first two arguments, NOTUR_ACCOUNT and REFERENCE are used in the same way as for prepare.start (see above), but a third argument is needed: DATA_DIR should be replaced by the name of the directory that holds all fastq.gz files to be analysed. Depending on how many fastq.gz files you have in your data directory, this script may trigger a (very) large number of parallel runs of map.sh at the same time. If things get out of control, you can always monitor your current Abel runs by typing
$ squeue -u USERNAME
where USERNAME should be replaced with your user name. If you need to cancel current runs, you can do that with
$ scancel RUN_ID
Subsequent to mapping with bwa mem, this script will also perform coordinate sorting with samtools, duplicate marking with picard-tools, indel realignment with GATK, and read group addition based on the standardized input file names, once again using picard-tools. It will write files in format SAMPLE_LIBRARY.sorted.dedup.realn.rgadd.bam (plus the corresponding index files SAMPLE_LIBRARY.sorted.dedup.realn.rgadd.bam.bai) to the specified data directory, and run report files will be saved as map.SAMPLE_LIBRARY.out and map.SAMPLE_LIBRARY.log just where you started the run.
Merging per-library bam files: merge.sh
This step is only really needed if you have data for more than one library per individual, however, you should still run it even if you only have a single library. The script merge.sh combines multiple libraries per individual with picard-tools, and then does a new round of samtools sorting, picard-tools deduplication, and GATK indel realignment on the merged file. Once again, merge.sh can be started in parallel for all samples, using the wrapper script merge.start. The output files will be named SAMPLE.merged.sorted.dedup.realn.bam and saved to the data directory (together with index files named SAMPLE.merged.sorted.dedup.realn.bam.bai). The reason why you should still run merge.start, even if you have only a single library is that downscream scripts within this pipeline expect files named SAMPLE.merged.sorted.dedup.realn.bam in that directory. As a solution, merge.start detects whether multiple libraries are present for each individual, and where that is not the case, it simply copies the file from SAMPLE_LIBRARY.sorted.dedup.realn.rgadd.bam to SAMPLE.merged.sorted.dedup.realn.bam (and SAMPLE_LIBRARY.sorted.dedup.realn.rgadd.bam.bai to SAMPLE.merged.sorted.dedup.realn.bam.bai), without submitting jobs to the Abel queue at all. Just like map.start, you start merge.start with
$ merge.start NOTUR_ACCOUNT REFERENCE DATA_DIR
Run report files will be named merge.SAMPLE.out and merge.SAMPLE.log, and saved to the directory, in which you started the script.
Variant calling with GATK: gatk1.sh and gatk2.sh
Since GATK version 3, variant calling with GATK's HaplotypeCaller is a two-step process, with single-sample calling in what they call gVCF mode, followed by joint genotyping analysis. In order to exploit the parallel processing capacities of Abel as much as we can, GATK variant calling is therefore split into two scripts: gatk1.sh for individual calling, which can be run in parallel for all merged bam files, and gatk2.sh for the joint analysis, which can not be parallelized. Again, gatk1.sh is started via the wrapper script gatk1.start, just like before:
$ gatk1.start NOTUR_ACCOUNT REFERENCE DATA_DIR
This script will produce a vcf file for each input bam file, named just like the input bam file, but with ending 'vcf' instead of 'bam: SAMPLE.merged.sorted.dedup.realn.vcf. These files will be written to the data directory, and run reports will be named gatk1.SAMPLE.out and gatk1.SAMPLE.log, and you will find these as well in the directory, where you started gatk1.start.
The second step of GATK variant calling runs with gatk2.sh, which as usual has its wrapper script gatk2.start, in which the reference and data dir specifications must be changed. After all gatk1.sh runs have finished, type
$ gatk2.start NOTUR_ACCOUNT REFERENCE DATA_DIR
This produces a file called gatk.vcf.gz and its index file gatk.vcf.gz.csi which will both be written to the data directory. This is the unfiltered variant call result, which can then be used for filtering and extraction of either SNPs or indels. Run report files named gatk2.out and gatk2.log will be written to the directory where you started the run.
Variant calling with FreeBayes: freebayes1.sh-freebayes3.sh
FreeBayes is a bit more tricky to use than GATK, since the program does not include a per-individual step that can be parallelized with Abel, and thus can take longer than the 1-week run time limit on Abel if you have lots of individuals and FreeBayes is run the standard way. The solution here is to first split the reference genome into different regions that can be analysed independently and in parallel by FreeBayes. This requires some logistics though. First, freebayes1.sh and freebayes1.start produce a header for the final variant call file. Next, we can run FreeBayes in parallel for each region in order to save time, and this is done with freebayes2.sh and freebayes2.start. Finally, the vcf result files for the individual regions should be combined with the joint vcf header that was produced in the first step, to form the combined vcf file freebayes.vcf. This last step is done by freebayes3.sh and freebayes3.start.
So we go
$ freebayes1.start NOTUR_ACCOUNT REFERENCE DATA_DIR
to produce a file for the vcf header. It will appear in the same directory and be named freebayes.header.vcf. Run reports named freebayes1.out and freebayes1.log will be written to the data directory.
Once all the region files are ready, we can start freebayes2.sh by
$ freebayes2.start NOTUR_ACCOUNT REFERENCE DATA_DIR
This starts parallel runs of freebayes2.sh for each individual region file. Behind the scenes, each individual run is multi-threaded for faster run times. Following variant calling with FreeBayes, the resulting vcf files need to be sorted, which is done with vcftools. Region-specific sorted vcf files will be written to the data directory, and run reports named freebayes2.REGION.txt and freebayes2.REGION.log will be written to where you started the run.
Finally, when all per-region vcf files were written to the data directory, we use freebayes3.sh to combine and sort them. This time the reference does not need to be specified:
$ freebayes3.start NOTUR_ACCOUNT DATA_DIR
This will write the joint but unfiltered variant calling result file freebayes.vcf.gz and the respective index file freebayes.vcf.gz.csi, again to the data directory, and run reports named freebayes3.out and freebayes3.log will be written to the directory, in which the script was started.
Variant calling with Samtools' mpileup: mpileup1.sh
As a third alternative, we're using Samtools' plain vanilla variant caller, mpileup, run by mpileup1.sh, which in turn is started with mpileup1.start:
$ mpileup1.start NOTUR_ACCOUNT REFERENCE DATA_DIR
Since so far, mpileup performed reasonably fast with the number of individuals tested, mpileup1.sh does not parallelize by splitting between individuals or regions, but this could possibly be implemented if need arises. It writes files mpileup.vcf.gz and mpileup.vcf.gz.csi to the data directory and run reports mpileup1.out and mpileup1.log to where you started the run.
Filtering SNPs from the three vcf files: filter_snps.sh
After GATK, FreeBayes, and Samtools' mpileup have finished, we should have three compressed vcf files, and the respective index files in the data directory: gatk.vcf.gz and gatk.vcf.gz.csi, freebayes.vcf.gz and freebayes.vcf.gz.csi, and mpileup.vcf.gz and mpileup.vcf.gz.csi. These are pretty much unfiltered, but filtering should be done before any downstream analyses to reduce the number of erroneous variant calls. Depending on the intended analysis, different levels of filtering can be optimal, but the goal here is to produce a consensus way of filtering that works well for most datasets. Note that SNPs with low minor allele frequencies are not filtered, as this type of filtering can always be done at later stages, and should be avoided for certain types of analyses (for example, it generates bias in coalescent-based analyses; McGill et al. 2013). Here, we first apply an initial set of filters that depend on whether the vcf file was produced with GATK, FreeBayes, or mpileup, and extract biallelic SNPs from each file. Second, we take the intersection of SNPs found with the three different variant callers, assuming that these represent a set of particularly reliable SNPs. And third, we apply a hard filter on quality and read depth to the set of SNPs found with all three tools. The first two steps are implemented in script filter_snps.sh, while the second and third step are performed with script intersect.sh (see below). The script filter_snps.sh uses bcftools to apply the following filters to the vcf files produced with the three variant callers (SNPs matching any of these criteria are excluded):
GATK: FS > 60.0, MQRankSum < -12.5, ReadPosRankSum < -8.0, QD < 2, RMSMappingQuality < 40 (this follows recommendations given here).
FreeBayes: No filters applied.
mpileup: RPB < 0.0001, MQB < 0.0001, BQB < 0.0001, MQSB < 0.0001.
In addition, all SNPs within 10 sites of an indel are removed from all three SNP sets, as these could be due to erroneous mapping. Indels and multi-allelic SNPs are removed as part of the script with vcftools.
Running the wrapper script filter_snps.start, the script filter_snps.sh is automatically started three times with the three different settings and the respective input files. Again, only the Notur account and the data directory need to be specified:
$ filter_snps.start NOTUR_ACCOUNT DATA_DIR
filter_snps.start writes three compressed and filtered vcf output files to the data directory: gatk.snps.vcf.gz, freebayes.snps.vcf.gz, and mpileup.snps.vcf.gz. In addition, index files for each of these files are written with the same name plus ending .csi. The output will be written to filter_snps.gatk.out, filter_snps.freebayes.out, and filter_snps.mpileup.out, and detailed reports can be found in filter_snps.gatk.log, filter_snps.freebayes.log, and filter_snps.mpileup.log.
Taking the intersection of the three vcf files: intersect_snps.sh
As all three SNP callers will find partially different sets of SNPs, the most robust way of SNP identification is to use only the overlap of SNPs, those that are found with all three variant callers. We do this with bcftools' isec command, implemented in script intersect_snps.start. In addition, this script includes a filtering step where all sites with a quality score lower than 20 and a read depth lower than 3 or higher than 100 (these are characteristic for repetitive regions) are replaced by missing data (for the respective individual only). This allows further filtering based on the proportion of missing data at a later stage if necessary.
$ intersect_snps.start NOTUR_ACCOUNT DATA_DIR
This will write intersect.snps.vcf.gz (and its index intersect.snps.vcf.gz.csi) which includes only those SNPs that are found in each of the three files gatk.snps.vcf.gz, freebayes.snps.vcf.gz, and mpileup.snps.vcf.gz. SNP calls and SNP annotation of this file will be identical to that in gatk.snps.vcf.gz. Run reports will be written to intersect_snps.out and intersect_snps.log.
In theory, running this variant calling pipeline should be as easy as this:
$ module load variant_calling $ prepare.start NOTUR_ACCOUNT REFERENCE # lean back and wait until prepare.sh has finished (minutes) $ map.start NOTUR_ACCOUNT REFERENCE DATA_DIR # have long lunch and wait until map.sh has finished (hours) $ merge.start NOTUR_ACCOUNT REFERENCE DATA_DIR # have another long lunch and wait until merge.sh has finished (hours) $ gatk1.start NOTUR_ACCOUNT REFERENCE DATA_DIR $ freebayes1.start NOTUR_ACCOUNT REFERENCE DATA_DIR $ mpileup1.start NOTUR_ACCOUNT REFERENCE DATA_DIR # go home and come back when at least gatk1.sh and freebayes1.sh have finished (hours/days) $ gatk2.start NOTUR_ACCOUNT REFERENCE DATA_DIR $ freebayes2.start NOTUR_ACCOUNT REFERENCE DATA_DIR # go on vacation until both gatk2.sh and freebayes2.sh have finished (days) $ freebayes3.start NOTUR_ACCOUNT DATA_DIR # should be done within minutes $ filter_snps.start NOTUR_ACCOUNT DATA_DIR # minutes to hours $ intersect_snps.start NOTUR_ACCOUNT DATA_DIR # few hours
In practice, it might need some tweaking. Let me know if you need help!