Sliding window analyses with vcf files

From mn/bio/cees-bioinf
Revision as of 13:36, 16 June 2015 by Michamat@uio.no (talk | contribs)

Jump to: navigation, search

Intro

After SNP calling, the next step will often be an analysis of genetic variation and population differentiation across chromosomes/linkage groups. We have a script developed that is supposed to allow this directly with vcf files, and the use of this script (called run_sliding_windows_over_vcf.py) is described below.

Preparation

The script is written in python3, and it requires a few python packages to be installed in order to run. First of all, load python3 with

$ module load python3

Next, the python packages PyVCF, Biopython may need to be installed on user level, if you didn't do so previously. After you have logged in to Abel or one of the cod nodes, run

$ pip3 install --user PyVCF
$ pip3 install --user Biopython

and these packages should automatically be installed (might take some time though).

Running the script

run_sliding_windows_over_vcf.py is written so that it runs over each chromosome/linkage group separately, and the name of the chromosome/linkage group has to be specified on the command line with option '-c'. In addition, exactly two population identifiers must be specified with option '-p' so that all sample names in the vcf file are matched by either the first or the second of the specified population identifiers, but none are matched by both. For example, if samples are named 'L01S123', 'L01S134', 'L01S246', 'L02S034', and 'L02S087', and the 'L01' and 'L02' parts indicate the different population assignments, then population identifiers would be specified with '-p L01 L02'. Error messages will be generated if the script is unable to uniquely assign samples the two population identifiers. Finally, a size of non-overlapping windows used to calculate genetic variation and population differentiation can be specified with option '-l', which has a default value of 1000 bp. Thus, in order to analyze variation and divergence in windows of 100000 bp on chromosome 1, with populations 'L01' and 'L02', one would run the script with

python3 /projects/cees/scripts/run_sliding_windows_over_vcf.py infile.vcf -c 1 -l 100000 -p L01 L02

where 'infile.vcf' is the name of the vcf input file name.

Output

This will produce a table which lists for each sliding window the following parameters:

  • The central position of the sliding window (window_center).
  • The number of SNPs found within this sliding window (num_snps).
  • The GC content, calculated only for SNP positions (gc_content).
  • The proportion of sites within the window that is variable in the first population (prop_var1).
  • The proportion of sites within the window that is variable in the second population (prop_var2).
  • The proportion of sites within the window that is variable overall, this should be equal to the number of SNPs divided by the window size (prop_var1+2).
  • The proportion of heterozyguous alleles found at SNP positions, for population 1 (prop_het1).
  • The proportion of heterozyguous alleles found at SNP positions, for population 2 (prop_het2).
  • The proportion of heterozyguous alleles found at SNP positions, for both populations combined (prop_het1+2).
  • The nucleotide variation measured as π (pi). For each individual SNP, π is equivalent to the probability that two randomly chosen alleles from the sample are different at this site. See Ruegg et al. (2014) for details. The value reported here represents the mean of π across all SNPs within the window.
  • The Fst value calculated according to Weir & Cockerham (1984) (f_st).
  • The between population sequence divergence (d_xy). Calculation follows Ruegg et al. (2014) and Cruickshank & Hahn (2014) explain why this will often be a more informative measure of between-population divergence than the Fst.
  • The proportion of fixed differences between the two populations (d_f). Again, see Ruegg et al. (2014) for details.

This output can then be used to plot variation and divergence over the chromosome/linkage group, for example with R. To do this, it may be more convenient to pipe the output into a separate file, e.g. with

python3 /projects/cees/scripts/run_sliding_windows_over_vcf.py infile.vcf -c 1 -l 100000 -p L01 L02 > output.txt