Check insert size distribution new sequencing library

From mn/bio/cees-bioinf
Jump to: navigation, search

 How to quickly check the insert size distribution of a new sequencing library

After you get the reads, be they paired end or mate pairs, from the sequencing centre, you sometimes want to chek thr distribution of the insert sizes. There are probably several ways to do this, but here is what we tend to use.

- either the enitre read dataset, or a subset of it (see below)
- a reference genome, or assembly, with at least a certain fraction of contigs/scaffolds large than the expected insert size

All programs are avaliable from the common bin folder unless otherwise indicated. We'll be using
- seqtk (optional)
- bwa
- samtools
- your favorite plotting program

Step 1: - optional: creating subsets of the data
For example, to create files with 1 million randomly selected reads - still paired up

seqtk sample ../read1.fastq.gz 1000000 >read1_1Mreads.fastq
seqtk sample ../read2.fastq.gz 1000000 >read2_1Mreads.fastq

To take 10% of the reads instead:

seqtk sample ../read1.fastq.gz 0.1 >read1_10%reads.fastq
seqtk sample ../read2.fastq.gz 0.1 >read2_10%reads.fastq

NOTE seqtk will happily accept gzipped fastq files

Step 2: indexing your reference
I prefer to make a new folder close to where the reference file is

mkdir bwa_index

Create a symlink (or 'soft link', see to the reference (optional, but this makes the next commands more readable)

ln -s path/to/reference.fasta . # note the period '.' at the end!

Create the index:

bwa-0.6.2 index -a bwtsw reference.fasta >bwa-index.out 2>&1

(all messages from the program will go to the file bwa_index.out)

Step 3: aligning the reads
Make a folder for the results and cd into it
Make symlinks to the read files and reference index (optional, but this makes the next commands more readable)

ln -s path/to/read1_1Mreads.fastq .
ln -s path/to/read2_1Mreads.fastq .
ln -s path/to/bwa_index .

Align reads

bwa-0.6.2 aln -t 24 bwa_index/reference.fasta read1_1Mreads.fastq >read1_1Mreads.sai 2> read1_1Mreads.bwa.err
bwa-0.6.2 aln -t 24 bwa_index/reference.fasta read2_1Mreads.fastq >read2_1Mreads.sai 2> read2_1Mreads.bwa.err

NOTE older Illumina data may require you to add the '-I' flag, check the documentation

Generate sam file:

bwa-0.6.2 sampe bwa_index/reference.fasta \
read1_1Mreads.sai \
read2_1Mreads.sai \
read1_1Mreads.fastq \
read2_1Mreads.fastq \
1>read1+2_1Mreads.sam 2> read1+2_1Mreads.bwa.err

Some tools need a sorted, compressed bam file. You can generate this using

module load samtools
cat read1+2_1Mreads.sam | samtools view -uS - | samtools sort - read1+2_1Mreads.sorted

The resulting file will be called read1+2_1Mreads.sorted.bam.

Alternatively, you can do it like this:

bwa-0.6.2 sampe bwa_index/reference.fasta \
read1_1Mreads.sai \
read2_1Mreads.sai \
read1_1Mreads.fastq \
read2_1Mreads.fastq \
2> read1+2_1Mreads.bwa.err | samtools view -uS - | samtools sort - read1+2_1Mreads.sorted

Now you could for example get some basic metrics on the mapping by running

samtools flagstat 130301_R1+2_001_10Mreads.sorted.bam

Step 4: plotting
To get the insert size distribution, extract the following from the sam file:
- the lines describing the mapped reads (these start with the name of the read)
- for these lines, the value of column 9, the insert size as determined by bwa, if it is positive and larger than 0