Check insert size distribution new sequencing library
How to quickly check the insert size distribution of a new sequencing library
After you get the reads, be they paired end or mate pairs, from the sequencing centre, you sometimes want to chek thr distribution of the insert sizes. There are probably several ways to do this, but here is what we tend to use.
Needed:
- either the enitre read dataset, or a subset of it (see below)
- a reference genome, or assembly, with at least a certain fraction of contigs/scaffolds large than the expected insert size
All programs are avaliable from the common bin folder unless otherwise indicated. We'll be using
- seqtk (optional)
- bwa
- samtools
- your favorite plotting program
Step 1: - optional: creating subsets of the data
For example, to create files with 1 million randomly selected reads - still paired up
seqtk sample ../read1.fastq.gz 1000000 >read1_1Mreads.fastq seqtk sample ../read2.fastq.gz 1000000 >read2_1Mreads.fastq
To take 10% of the reads instead:
seqtk sample ../read1.fastq.gz 0.1 >read1_10%reads.fastq seqtk sample ../read2.fastq.gz 0.1 >read2_10%reads.fastq
NOTE seqtk will happily accept gzipped fastq files
Step 2: indexing your reference
I prefer to make a new folder close to where the reference file is
mkdir bwa_index
Create a symlink (or 'soft link', see http://kb.iu.edu/data/abbe.html) to the reference (optional, but this makes the next commands more readable)
ln -s path/to/reference.fasta . # note the period '.' at the end!
Create the index:
bwa-0.6.2 index -a bwtsw reference.fasta >bwa-index.out 2>&1
(all messages from the program will go to the file bwa_index.out)
Step 3: aligning the reads
Make a folder for the results and cd into it
Make symlinks to the read files and reference index (optional, but this makes the next commands more readable)
ln -s path/to/read1_1Mreads.fastq . ln -s path/to/read2_1Mreads.fastq . ln -s path/to/bwa_index .
Align reads
bwa-0.6.2 aln -t 24 bwa_index/reference.fasta read1_1Mreads.fastq >read1_1Mreads.sai 2> read1_1Mreads.bwa.err bwa-0.6.2 aln -t 24 bwa_index/reference.fasta read2_1Mreads.fastq >read2_1Mreads.sai 2> read2_1Mreads.bwa.err
NOTE older Illumina data may require you to add the '-I' flag, check the documentation
Generate sam file:
bwa-0.6.2 sampe bwa_index/reference.fasta \ read1_1Mreads.sai \ read2_1Mreads.sai \ read1_1Mreads.fastq \ read2_1Mreads.fastq \ 1>read1+2_1Mreads.sam 2> read1+2_1Mreads.bwa.err
Some tools need a sorted, compressed bam file. You can generate this using
module load samtools cat read1+2_1Mreads.sam | samtools view -uS - | samtools sort - read1+2_1Mreads.sorted
The resulting file will be called read1+2_1Mreads.sorted.bam.
Alternatively, you can do it like this:
bwa-0.6.2 sampe bwa_index/reference.fasta \ read1_1Mreads.sai \ read2_1Mreads.sai \ read1_1Mreads.fastq \ read2_1Mreads.fastq \ 2> read1+2_1Mreads.bwa.err | samtools view -uS - | samtools sort - read1+2_1Mreads.sorted
Now you could for example get some basic metrics on the mapping by running
samtools flagstat 130301_R1+2_001_10Mreads.sorted.bam
Step 4: plotting
To get the insert size distribution, extract the following from the sam file:
- the lines describing the mapped reads (these start with the name of the read)
- for these lines, the value of column 9, the insert size as determined by bwa, if it is positive and larger than 0