Similarity, homology and orthology

From mn/ibv/bioinfwiki
Jump to: navigation, search

Similarity, homology and orthology

Often, the primary goal of BLAST searches is to identify homologs of a given query sequence. We are using the similarity reported by BLAST to infer such homology. But it is important to realize that similarity as reported by BLAST not necessarily implies homology. For instance, short repeats such as “ATATATATA” can be caused by slippage in the DNA reproduction process. Thus, there may be sequence similarity between two sequences without these being related by a common ancestor. For this reason, it is useful to mask repeats before running BLAST searches, for instance using programs such as RepeatMasker (http://repeatmasker.org/. BLAST itself also has the capability of masking repeats using the “–dust” option.
In addition, we often are interested in identifying orthologs, rather than mere homologs. Ordinary BLAST results cannot distinguish between homologs and orthologs. Some results can be obtained by using a reciprocal BLAST search. For instance, we have the proteome of organism X, and use protein “X1” contained therein to BLAST the proteome of an organism Y (which we suspect contains a ortholog of “X”). Our BLAST search identifies a protein “Y1” as exhibiting good similarity to “X1”. But we still do not know whether the proteome of “X” may contain other proteins more similar to “Y1” than “X1”. To resolve this, we now use the protein “Y1” as a query against the whole proteome of “X”. If this produces “X1” as the top hit, we can be (reasonably) certain that “X1” and “Y1” are true orthologs. If “X1” is not recovered as the top hit, we know that “X1” and “Y1” are not orthologs, as they do not constitute a high-scoring pair.
It should be noted that various ortholog identification algorithms exist. These may perform better than using the above schema. Some of those, like OrthoMCL, indeed do use BLAST searches as part of their algorithm.