Difference between revisions of "Sliding window analyses with vcf files"

From mn/bio/cees-bioinf
Jump to: navigation, search
(Output)
 
(6 intermediate revisions by the same user not shown)
Line 7: Line 7:
 
The script is written in python3, and it requires a few python packages to be installed in order to run. First of all, load python3 with
 
The script is written in python3, and it requires a few python packages to be installed in order to run. First of all, load python3 with
 
<pre>$ module load python3</pre>
 
<pre>$ module load python3</pre>
 
 
Next, the python packages PyVCF, Biopython may need to be installed on user level, if you didn't do so previously. After you have logged in to Abel or one of the cod nodes, run
 
Next, the python packages PyVCF, Biopython may need to be installed on user level, if you didn't do so previously. After you have logged in to Abel or one of the cod nodes, run
<pre>
+
<pre>$ pip3 install --user PyVCF
$ pip3 install --user PyVCF
 
 
$ pip3 install --user Biopython
 
$ pip3 install --user Biopython
 
</pre>
 
</pre>
and these packages should automatically be installed (might take some time though).
+
and these packages should automatically be installed (might take 10 minutes or so).
  
 
== Running the script ==
 
== Running the script ==
  
run_sliding_windows_over_vcf.py is written so that it runs over each chromosome/linkage group separately, and the name of the chromosome/linkage group has to be specified on the command line with option '-c'. In addition, exactly two population identifiers must be specified with option '-p' so that all sample names in the vcf file are matched by either the first or the second of the specified population identifiers, but none are matched by both. For example, if samples are named 'L01S123', 'L01S134', 'L01S246', 'L02S034', and 'L02S087', and the 'L01' and 'L02' parts indicate the different population assignments, then population identifiers would be specified with '-p L01 L02'. Error messages will be generated if the script is unable to uniquely assign samples the two population identifiers. Finally, a size of non-overlapping windows used to calculate genetic variation and population differentiation can be specified with option '-l', which has a default value of 1000 bp. Thus, in order to analyze variation and divergence in windows of 100000 bp on chromosome 1, with populations 'L01' and 'L02', one would run the script with
+
run_sliding_windows_over_vcf.py is written so that it runs over each chromosome/linkage group separately, and the name of the chromosome/linkage group has to be specified on the command line with option '-c'. In addition, exactly two population identifiers must be specified with option '-p' so that all sample names in the vcf file are matched by either the first or the second of the specified population identifiers, but none are matched by both. For example, if samples are named 'L01S123', 'L01S134', 'L01S246', 'L02S034', and 'L02S087', and the 'L01' and 'L02' parts indicate the different population assignments, then population identifiers would be specified with '-p L01 L02'. Error messages will be generated if the script is unable to uniquely assign samples the two population identifiers. Finally, a size of non-overlapping windows used to calculate genetic variation and population differentiation can be specified with option '-w', which has a default value of 1000 bp. Thus, in order to analyze variation and divergence in windows of 100000 bp on chromosome 1, with populations 'L01' and 'L02', one would run the script with
<pre>python3 /projects/cees/scripts/run_sliding_windows_over_vcf.py infile.vcf -c 1 -l 100000 -p L01 L02</pre>
+
<pre>python3 /projects/cees/scripts/run_sliding_windows_over_vcf.py infile.vcf -c 1 -w 100000 -p L01 L02</pre>
 
where 'infile.vcf' is the name of the vcf input file name.
 
where 'infile.vcf' is the name of the vcf input file name.
  
Line 35: Line 33:
 
*The proportion of heterozyguous alleles found at SNP positions, for both populations combined (prop_het1+2).
 
*The proportion of heterozyguous alleles found at SNP positions, for both populations combined (prop_het1+2).
 
*The nucleotide variation measured as π (pi). For each individual SNP, π is equivalent to the probability that two randomly chosen alleles from the sample are different at this site. See [http://onlinelibrary.wiley.com/doi/10.1111/mec.12842/abstract Ruegg et al. (2014)] for details. The value reported here represents the mean of π across all SNPs within the window.
 
*The nucleotide variation measured as π (pi). For each individual SNP, π is equivalent to the probability that two randomly chosen alleles from the sample are different at this site. See [http://onlinelibrary.wiley.com/doi/10.1111/mec.12842/abstract Ruegg et al. (2014)] for details. The value reported here represents the mean of π across all SNPs within the window.
*The Fst value calculated according to [http://www.jstor.org/stable/2408641?seq=1#page_scan_tab_contents Weir & Cockerham (1984)] (f_st).
+
*The F<sub>st</sub> value calculated according to [http://www.jstor.org/stable/2408641?seq=1#page_scan_tab_contents Weir & Cockerham (1984)] (f_st).
*The between population sequence divergence (d_xy). Calculation follows [http://onlinelibrary.wiley.com/doi/10.1111/mec.12842/abstract Ruegg et al. (2014)] and [http://onlinelibrary.wiley.com/doi/10.1111/mec.12796/abstract Cruickshank & Hahn (2014)] explain why this will often be a more informative measure of between-population divergence than the Fst.
+
*The between population sequence divergence (d<sub>xy</sub>). Calculation follows [http://onlinelibrary.wiley.com/doi/10.1111/mec.12842/abstract Ruegg et al. (2014)] and [http://onlinelibrary.wiley.com/doi/10.1111/mec.12796/abstract Cruickshank & Hahn (2014)] explain why the d<sub>xy</sub> will often be a more informative measure of between-population divergence than the F<sub>st</sub>. Two different versions of d<sub>xy</sub> are reported, d<sub>xy</sub>* and d<sub>xy</sub>**. Of these, d<sub>xy</sub>* uses the window size as 'contig length' Lk (see [http://onlinelibrary.wiley.com/doi/10.1111/mec.12842/abstract Ruegg et al. 2014]). Thus d<sub>xy</sub>* takes into account the number of invariant sites, and will vary depending on filtering options. In contrast, d<sub>xy</sub>** ignores all positions that are not bi-allelic SNPs. It should be more robust to vcf filtering and is equivalent to the mean contribution of a SNP to d<sub>xy</sub>* measure (above).
 
*The proportion of fixed differences between the two populations (d_f). Again, see [http://onlinelibrary.wiley.com/doi/10.1111/mec.12842/abstract Ruegg et al. (2014)] for details.
 
*The proportion of fixed differences between the two populations (d_f). Again, see [http://onlinelibrary.wiley.com/doi/10.1111/mec.12842/abstract Ruegg et al. (2014)] for details.
  
 
This output can then be used to plot variation and divergence over the chromosome/linkage group, for example with R. To do this, it may be more convenient to pipe the output into a separate file, e.g. with
 
This output can then be used to plot variation and divergence over the chromosome/linkage group, for example with R. To do this, it may be more convenient to pipe the output into a separate file, e.g. with
<pre>python3 /projects/cees/scripts/run_sliding_windows_over_vcf.py infile.vcf -c 1 -l 100000 -p L01 L02 > output.txt</pre>
+
<pre>python3 /projects/cees/scripts/run_sliding_windows_over_vcf.py infile.vcf -c 1 -w 100000 -p L01 L02 > output.txt</pre>

Latest revision as of 11:52, 17 December 2015

Intro

After SNP calling, the next step will often be an analysis of genetic variation and population differentiation across chromosomes/linkage groups. We have a script developed that is supposed to allow this directly with vcf files, and the use of this script (called run_sliding_windows_over_vcf.py) is described below.

Preparation

The script is written in python3, and it requires a few python packages to be installed in order to run. First of all, load python3 with

$ module load python3

Next, the python packages PyVCF, Biopython may need to be installed on user level, if you didn't do so previously. After you have logged in to Abel or one of the cod nodes, run

$ pip3 install --user PyVCF
$ pip3 install --user Biopython

and these packages should automatically be installed (might take 10 minutes or so).

Running the script

run_sliding_windows_over_vcf.py is written so that it runs over each chromosome/linkage group separately, and the name of the chromosome/linkage group has to be specified on the command line with option '-c'. In addition, exactly two population identifiers must be specified with option '-p' so that all sample names in the vcf file are matched by either the first or the second of the specified population identifiers, but none are matched by both. For example, if samples are named 'L01S123', 'L01S134', 'L01S246', 'L02S034', and 'L02S087', and the 'L01' and 'L02' parts indicate the different population assignments, then population identifiers would be specified with '-p L01 L02'. Error messages will be generated if the script is unable to uniquely assign samples the two population identifiers. Finally, a size of non-overlapping windows used to calculate genetic variation and population differentiation can be specified with option '-w', which has a default value of 1000 bp. Thus, in order to analyze variation and divergence in windows of 100000 bp on chromosome 1, with populations 'L01' and 'L02', one would run the script with

python3 /projects/cees/scripts/run_sliding_windows_over_vcf.py infile.vcf -c 1 -w 100000 -p L01 L02

where 'infile.vcf' is the name of the vcf input file name.

Output

This will produce a table which lists for each sliding window the following parameters:

  • The central position of the sliding window (window_center).
  • The number of SNPs found within this sliding window (num_snps).
  • The GC content, calculated only for SNP positions (gc_content).
  • The proportion of sites within the window that is variable in the first population (prop_var1).
  • The proportion of sites within the window that is variable in the second population (prop_var2).
  • The proportion of sites within the window that is variable overall, this should be equal to the number of SNPs divided by the window size (prop_var1+2).
  • The proportion of heterozyguous alleles found at SNP positions, for population 1 (prop_het1).
  • The proportion of heterozyguous alleles found at SNP positions, for population 2 (prop_het2).
  • The proportion of heterozyguous alleles found at SNP positions, for both populations combined (prop_het1+2).
  • The nucleotide variation measured as π (pi). For each individual SNP, π is equivalent to the probability that two randomly chosen alleles from the sample are different at this site. See Ruegg et al. (2014) for details. The value reported here represents the mean of π across all SNPs within the window.
  • The Fst value calculated according to Weir & Cockerham (1984) (f_st).
  • The between population sequence divergence (dxy). Calculation follows Ruegg et al. (2014) and Cruickshank & Hahn (2014) explain why the dxy will often be a more informative measure of between-population divergence than the Fst. Two different versions of dxy are reported, dxy* and dxy**. Of these, dxy* uses the window size as 'contig length' Lk (see Ruegg et al. 2014). Thus dxy* takes into account the number of invariant sites, and will vary depending on filtering options. In contrast, dxy** ignores all positions that are not bi-allelic SNPs. It should be more robust to vcf filtering and is equivalent to the mean contribution of a SNP to dxy* measure (above).
  • The proportion of fixed differences between the two populations (d_f). Again, see Ruegg et al. (2014) for details.

This output can then be used to plot variation and divergence over the chromosome/linkage group, for example with R. To do this, it may be more convenient to pipe the output into a separate file, e.g. with

python3 /projects/cees/scripts/run_sliding_windows_over_vcf.py infile.vcf -c 1 -w 100000 -p L01 L02 > output.txt