BLAST for dummies

From mn/ibv/bioinfwiki
Revision as of 11:27, 23 February 2015 by (talk | contribs)

Jump to: navigation, search

BLAST for dummies

Sequence similarity searches: queries and hits

The BLAST algorithm is more or less the standard way of performing sequence similarity searches. With ‘sequences’, we mean biological (nucleotide or amino acid) sequences. There are many different reasons as to why such searches may be performed. Typically, the user has one (or many) unknown sequences, and he/she wants to understand what these sequences are or what they do. In the terminology used by BLAST, these are the query sequences. A sequence search will (hopefully) identify sequences that are similar (or even identical) to the queries. The identified sequences are often called the hit sequences (or just hits). Typically, there is much more known about the hits than the query. For instance, we may know that a specific hit is an enzyme. If the match between the query and the hit is sufficient good, we may conclude that the query sequence also is an enzyme (but not necessarily with exactly the same specificity!). Sometimes, we also perform BLAST searches with queries that are already known to the user. In keeping with the previous example, we may use the sequence of a well-known enzyme as a query sequence. After performing a BLAST search, the hit sequences do not help us identify the nature of the query sequence, but they may tell us something about the distribution of this particular protein in other organisms (provided this information is included in the hit descriptions).

The BLAST database

From the above it is clear that, in order to provide information about a given query, BLAST needs a collection of sequences that the query is compared to. Such a collection of queries is known as the database. When executing, BLAST will compare the query sequence to every single sequence in the database. If a similarity is detected, BLAST will output this sequence as a hit. Both the query and the database must be formatted as FASTA files, i.e. each sequence must contain a header starting with the “>” character, followed by the actual sequence on the following lines. The database will often consist of one FASTA file containing very many separate FASTA sequences. Such BLAST databases can be created by the user, but often previously created databases are used.
Sometimes, the user wants to find hit sequences that are 100% identical to the query. Such a search is obviously easy to accomplish. Finding matches that are similar (but not identical) is a much more difficult task. BLAST is (within certain limits) able to do this. But this also implies that not all hits for a given query are equal; some hits will be better than others. In fact, some hits may display so little similarity with the query that we should disregard them altogether.

Aligning query and hit sequences

How does BLAST identify similarity between sequences? BLAST tries to create an alignment between the query and a given database sequence. To start with, a short 100% identical match must be found between the query and database sequences. If such a match is found, the alignment is extended in both directions. Matching characters are awarded points; if the sum of these points keeps increasing, the extension continues. If the sum of points drops below a limit, the alignment extension is stopped, and the hit is reported. 
It is not necessary to understand the exact mechanism behind this algorithm. But it is clear that BLAST needs to be instructed about the precise manner of scoring an alignment (i.e. awarding points for matching characters). If using nucleotide sequences, this is accomplished in a very simple manner: only matching characters increase the point score. But when using amino acid sequences, this becomes a bit more complicated. Some amino acids are so dissimilar that they are not awarded points (or indeed get a negative score). But some amino acids are quite similar to each other, such as leucine and isoleucine. These get scores almost as good as identical amino acids. The precise scores for every possible amino acid pair are defined in so-called matrix files. The standard BLAST matrix is called the BLOSUM62 matrix. Along with specifying a query and a database, the user needs to specify which matrix to use when running BLAST.
It is important to understand that this way of creating alignments is not a perfect algorithm. It is used in BLAST because it is very fast, but it will miss or under-report certain types of similarities. (The interested reader may look up “dynamic programming” to find an algorithm that theoretically will produce perfect alignments). The great advantage of BLAST is not its exactness, but its speed.

Understanding the BLAST output

It should be clear from the above that the output of BLAST consists of a list of hits for a given query sequence. The hits are ordered according to their similarity with the query. The most basic measurement of similarity is the “bitscore” or just (“score”), which simply reflects the points awarded the BLAST-generated alignment. The score is recalculated to provide the “E-value”, which quantifies the possibility of a hit being produced just by chance. 
It is possible to run BLAST specifying multiple query sequences. In that case, BLAST simply processes one query at the time, and adds the output to the same output file, starting with a definition of the query used. If using many queries in one BLAST run, the output thereof can quickly become overwhelming. In that case, it is useful to use a tool to visualize the BLAST output. One such tool has been developed at UoO:

BLASTGrabber: a bioinformatic tool for visualization, analysis and sequence selection of massive BLAST data (doi:10.1186/1471-2105-15-128)

If you are interested in other options, you can read the following paper:

BLAST output visualization in the new sequencing era (doi: 10.1093/bib/bbt009)