Blastn vs. blastp

From mn/ibv/bioinfwiki
Jump to: navigation, search

blastn vs. blastp

If you are looking for a certain protein-coding gene-of-interest (GOI) in the genome of some organism, you often may have the choice of running either the blastn or the blastp program. In the former case, you would use the GOI mRNA sequence as a query and the genome of the organism as a sequence database. If using blastp instead, you may use the protein sequence of the GOI as a query against a database comprising the proteome of the organism. You may think the results of these two searches should be comparable. After all, if the organism in question has a homolog of the GOI, it both the nucleotide and the amino acid sequence should be identified.

In fact, blastp found a homolog of the human AGO2 sequence with an e-value of 3e-38 in the proteome of Sphaeroforma arctica (a single-celled ancestor of animals). When repeating this search using the AGO2 mRNA sequence as a query, and the genome of S. arctica as a database, no hits were reported at all. What is happening here?

Blastn is in fact a rather poor tool for finding protein-coding sequences. This is in part due to the wobble position of the third nucleotide in most codons. Most amino acids can be encoded by multiple codons differing in the third position. Thus the exact same amino acid sequence can be encoded by two nucleotide sequences differing in every third position (since mutations in the third position do not affect the resulting protein, such mutations typically accumulate quite rapidly). The amino acid sequences being identical, blastp would have no problem in retrieving one sequence, using the other sequence as query. Blastn, however, uses a default word size of 11 nucleotides. This means the two sequences must match with at least 11 nucleotides for blastn to be able to report any hit at all. In the above example, when setting the word size to 6, the best hit had an e-value of 0.031. In this case, a perfect match of 6 nucleotides was found between the query and database sequences, but blastn was not able to extend this alignment very much, explaining the bad e-value (often, this would not be considered a significant hit).

The obvious solution to this problem is to use blastp rather than blastn whenever possible. If only nucleotide sequences are available, blastx, tblastn or tblastx should be used, provided we are searching for protein-coding sequences. These algorithms will translate the query, database or both into protein sequences before blasting these with blastp, thus avoiding the blastn-related problems. But still, introns in genomic data will make it difficult to produce lengthy alignments using these methods.

If looking for nucleotide sequences not coding for proteins, blastn is still a rather poor tool. The reason here is similar to codon-usage in protein-coding sequences: often, such non-coding sequences result in functional RNA molecules rather than proteins. These RNA molecules adopt a specific secondary structure, held together by base-pairing. Compensating mutations may change the RNA sequence without changing the RNA secondary structure. For instance, an A-U base-pairing may change to a G-C base-pairing, retaining the structure but changing the sequence. The resulting RNA may retain its functionality which is determined by this structure, but the underlying sequence may change enough to become unrecognizable by the blastn algorithm. If looking for non-coding RNA sequences, specialized tools may be used instead of blastn. For instance, tRNAScan-SE is a specialized tool made to recognize tRNA sequences, Infernal (and the Rfam database) is a general RNA-finding algorithm. If it is necessary to use blastn, make sure to experiment with different word sizes before concluding that a sequence is not present in a database.