Improving BLAST hits

From mn/ibv/bioinfwiki
Jump to: navigation, search

Improving BLAST hits

Sometimes, BLAST searches may not return the significant hits expected. It may be tempting to “improve” upon these results, and thus to retain the significance of the hits. For instance, people may inspect the BLAST alignment, and conclude that it looks valid, even though the associated e-values are bad. This must be viewed with much scepticism. There is little reason to believe that people, glancing at the alignment, are capable of judging the quality thereof, and simultaneously relating this correctly to the database and query sizes. However, there are certain things that can be done to improve the faith in a mediocre BLAST hit. Importantly, all of these have in common that they depend on data or algorithms not accessible to BLAST. 

Adjusting BLAST search parameters

Ideally, optimal BLAST parameters should of course be used in all BLAST searches. Still, it may be difficult to know exactly what these are. It may prove worthwhile to change especially the word size parameter if performing blastn searches (see below). Other important parameters are the matrix type, codon usage, and possibly the gap-opening and gap-extension costs.

Reducing the database size

A smaller database will cause an improvement in e-values. In order to produce valid results, the database must be a reasonable subset of the data used. For instance, if looking for a protein in the proteome of an organism X, it makes sense to search only the proteome of this organism, rather than the “nr” database (which contains protein sequences from a vast number of organisms). An invalid example would be to restrict the search to proteins with protein headers written in upper case letters (because this was previously observed for a desired, but insignificant hit).

Increasing the quality of the alignment

The BLAST algorithm for creating alignments is developed for speed, not quality. Thus it is possible that a much better alignment between query and hit is “hiding” under the BLAST result. The Smith-Waterman dynamic programming algorithms ( are much better pairwise-alignment algorithms; they may be used to see whether it is possible to demonstrate a more extensive similarity between query and hit than was possible with BLAST.

Using addition positional information

A mediocre alignment may still be judged significant if it falls in the expected region of the sequence, and contains vital positions. For instance, the active sites of enzymes contain conserved residues that are not expected to change. An otherwise mediocre alignment that is restricted to such an active site, and correctly identifies the conserved residues, may be accepted.

Adding query sequences

If BLASTing across great evolutionary distances, it is worthwhile including multiple homologs of the sequence of interest. This increases the chances of finding a closer match in the database. Note, however, that if using a great number of homologs, the BLAST e-values should be corrected to reflect the multiple testing done.