Difference between revisions of "Measuring BLAST hits"

From mn/ibv/bioinfwiki
Jump to: navigation, search
(Created page with "= Measuring BLAST hits = == Similarity: bitscores == <div>BLAST implements several ways of evaluating the quality of a hit. The bitscore (or just score) is a single number, r...")
 
Line 4: Line 4:
 
<div>BLAST implements several ways of evaluating the quality of a hit. The bitscore (or just score) is a single number, representing the quality of the BLAST-generated query/hit alignment. Higher bitscores imply better alignments, but note that alignments can become “better” both by increasing the length of the alignment, but also by bettering the match between the involved characters (i.e. identical amino acids rather than just similar amino acids). Given that the matrix, gap-opening and gap-extension costs are the same, bitscores can to some degree be compared between different BLAST searches. Thus you may conclude that we have a better query/hit match in search 1 than search 2 if search 1 produced a higher bitscore than search 2. But still, in search 1 a higher bitscore may be produced by a relative short, but perfect alignment. Search 2 may have resulted in a much longer, but imperfect alignment, scored lower than the search 1-result. In this scenario, it is not certain that the search1-result is “better” than the result of search 2. In the end, this must be determined by the user him/herself.</div>
 
<div>BLAST implements several ways of evaluating the quality of a hit. The bitscore (or just score) is a single number, representing the quality of the BLAST-generated query/hit alignment. Higher bitscores imply better alignments, but note that alignments can become “better” both by increasing the length of the alignment, but also by bettering the match between the involved characters (i.e. identical amino acids rather than just similar amino acids). Given that the matrix, gap-opening and gap-extension costs are the same, bitscores can to some degree be compared between different BLAST searches. Thus you may conclude that we have a better query/hit match in search 1 than search 2 if search 1 produced a higher bitscore than search 2. But still, in search 1 a higher bitscore may be produced by a relative short, but perfect alignment. Search 2 may have resulted in a much longer, but imperfect alignment, scored lower than the search 1-result. In this scenario, it is not certain that the search1-result is “better” than the result of search 2. In the end, this must be determined by the user him/herself.</div>
 
== Significance: E-values ==
 
== Significance: E-values ==
<div>In addition to the bitscore, an e-value is reported for each BLAST hit. This value indicates whether this hit may be due to chance, rather than a real similarity between query and hit sequence. The e-value is based on the bitscore, but is transformed according to the sizes of the query and the database. This transformation implies that “good” e-values are very small positive values (in theory, this value may never equal zero; due to rounding of floating point numbers, BLAST may still report a zero e-value). “Good” in this context means that there is almost no possibility of the BLAST hit being caused by chance alone; a true similarity between query and hit can be assumed. In other words, the BLAST hit is statistically significant.</div><div>This calculation is assuming a random database and a random query sequence. &nbsp;Given this, and given a certain query-hit alignment, the e-value quantifies the possibility of finding a similar (or better) alignment. Note that the e-value is NOT a p-value (a p-value denotes the percentage possibility that a given result is caused by chance). Rather, the e-value is the number of hits to expect with a random database and random query that are as good as or better than our hit. Thus a hit with an e-value of 2 means that two hits equal to or better than this hit can be expected in a random scenario. In other words, this hit is quite likely to have been caused by chance, and is not significant. The usage of e-values instead of p-values is nothing more than a convention; the one can be transformed into the other: P=1-e<sup>-E</sup> (P=p-value, E=e-value). Furthermore, for small values (E<0.01), these numbers become quite similar.</div><div><br/></div><div>As stated above, the e-value depends on the size of the database (and, to a lesser degree, on the size of the query). Increasing the database makes it harder to achieve e-values. The reason for this is simple: in the random model of our database, more sequences (or longer sequences) give more opportunities for finding a random hit. Imagine for instance throwing a dice 60 times. Probably around 10 sixes will result. Increasing the number of throws obviously also increases the resulting sixes (throwing 6000 times will produce around 1000 sixes). &nbsp;</div><div>This also means that two hits with equal alignments will have different e-values if they resulted from searches against differing databases. In fact, of two identical hits (i.e. identical queries, hit sequences and identical resulting alignments) one may be highly significant, whilst the other may be insignificant. This is then caused by the differing database sizes.</div>
+
<div>In addition to the bitscore, an e-value is reported for each BLAST hit. This value indicates whether this hit may be due to chance, rather than a real similarity between query and hit sequence. The e-value is based on the bitscore, but is transformed according to the sizes of the query and the database. This transformation implies that “good” e-values are very small positive values (in theory, this value may never equal zero; due to rounding of floating point numbers, BLAST may still report a zero e-value). “Good” in this context means that there is almost no possibility of the BLAST hit being caused by chance alone; a true similarity between query and hit can be assumed. In other words, the BLAST hit is statistically significant.</div><div><br/></div><div>This calculation is assuming a random database and a random query sequence. &nbsp;Given this, and given a certain query-hit alignment, the e-value quantifies the possibility of finding a similar (or better) alignment. Note that the e-value is NOT a p-value (a p-value denotes the percentage possibility that a given result is caused by chance). Rather, the e-value is the number of hits to expect with a random database and random query that are as good as or better than our hit. Thus a hit with an e-value of 2 means that two hits equal to or better than this hit can be expected in a random scenario. In other words, this hit is quite likely to have been caused by chance, and is not significant. The usage of e-values instead of p-values is nothing more than a convention; the one can be transformed into the other: P=1-e<sup>-E</sup> (P=p-value, E=e-value). Furthermore, for small values (E<0.01), these numbers become quite similar.</div><div><br/></div><div>As stated above, the e-value depends on the size of the database (and, to a lesser degree, on the size of the query). Increasing the database makes it harder to achieve e-values. The reason for this is simple: in the random model of our database, more sequences (or longer sequences) give more opportunities for finding a random hit. Imagine for instance throwing a dice 60 times. Probably around 10 sixes will result. Increasing the number of throws obviously also increases the resulting sixes (throwing 6000 times will produce around 1000 sixes). &nbsp;</div><div><br/></div><div>This also means that two hits with equal alignments will have different e-values if they resulted from searches against differing databases. In fact, of two identical hits (i.e. identical queries, hit sequences and identical resulting alignments) one may be highly significant, whilst the other may be insignificant. This is then caused by the differing database sizes.</div>
 
== Similarity vs. significance&nbsp; ==
 
== Similarity vs. significance&nbsp; ==
<div>Often, e-values are used as an indication of the quality of a BLAST hit, i.e. the goodness of the underlying alignment. From the above, it should be clear that this is, at best, an imprecise way of determining goodness of BLAST hits. Some of the confusion often surrounding this is caused by people forgetting that the e-value quantifies the possibility of random hits, and nothing more. Accepting this, it is clear that the interpretation of e-values do not change with database size; an e-value of 1e-50 is clearly significant (not caused by chance) no matter what database produced the hit. &nbsp;On the other hand, the bitscores (i.e. the qualities of the underlying alignments) that produced these e-values of 1e-50 may differ quite a bit. For these two hits with identical e-values, the bitscore of the hit produced by the largest database will be better than the bitscore of the other hit.</div><div><br/></div>
+
<div>Often, e-values are used as an indication of the quality of a BLAST hit, i.e. the goodness of the underlying alignment. From the above, it should be clear that this is, at best, an imprecise way of determining goodness of BLAST hits. Some of the confusion often surrounding this is caused by people forgetting that the e-value quantifies the possibility of random hits, and nothing more. Accepting this, it is clear that the interpretation of the e-value does not change with database size; an e-value of 1e-50 is clearly significant (not caused by chance) no matter what database produced the hit. &nbsp;On the other hand, the bitscores (i.e. the qualities of the underlying alignments) that produced these e-values of 1e-50 may differ quite a bit. For these two hits with identical e-values, the bitscore of the hit produced by the largest database will be better than the bitscore of the other hit.</div><div><br/></div>

Revision as of 11:37, 23 February 2015

Measuring BLAST hits

Similarity: bitscores

BLAST implements several ways of evaluating the quality of a hit. The bitscore (or just score) is a single number, representing the quality of the BLAST-generated query/hit alignment. Higher bitscores imply better alignments, but note that alignments can become “better” both by increasing the length of the alignment, but also by bettering the match between the involved characters (i.e. identical amino acids rather than just similar amino acids). Given that the matrix, gap-opening and gap-extension costs are the same, bitscores can to some degree be compared between different BLAST searches. Thus you may conclude that we have a better query/hit match in search 1 than search 2 if search 1 produced a higher bitscore than search 2. But still, in search 1 a higher bitscore may be produced by a relative short, but perfect alignment. Search 2 may have resulted in a much longer, but imperfect alignment, scored lower than the search 1-result. In this scenario, it is not certain that the search1-result is “better” than the result of search 2. In the end, this must be determined by the user him/herself.

Significance: E-values

In addition to the bitscore, an e-value is reported for each BLAST hit. This value indicates whether this hit may be due to chance, rather than a real similarity between query and hit sequence. The e-value is based on the bitscore, but is transformed according to the sizes of the query and the database. This transformation implies that “good” e-values are very small positive values (in theory, this value may never equal zero; due to rounding of floating point numbers, BLAST may still report a zero e-value). “Good” in this context means that there is almost no possibility of the BLAST hit being caused by chance alone; a true similarity between query and hit can be assumed. In other words, the BLAST hit is statistically significant.

This calculation is assuming a random database and a random query sequence.  Given this, and given a certain query-hit alignment, the e-value quantifies the possibility of finding a similar (or better) alignment. Note that the e-value is NOT a p-value (a p-value denotes the percentage possibility that a given result is caused by chance). Rather, the e-value is the number of hits to expect with a random database and random query that are as good as or better than our hit. Thus a hit with an e-value of 2 means that two hits equal to or better than this hit can be expected in a random scenario. In other words, this hit is quite likely to have been caused by chance, and is not significant. The usage of e-values instead of p-values is nothing more than a convention; the one can be transformed into the other: P=1-e-E (P=p-value, E=e-value). Furthermore, for small values (E<0.01), these numbers become quite similar.

As stated above, the e-value depends on the size of the database (and, to a lesser degree, on the size of the query). Increasing the database makes it harder to achieve e-values. The reason for this is simple: in the random model of our database, more sequences (or longer sequences) give more opportunities for finding a random hit. Imagine for instance throwing a dice 60 times. Probably around 10 sixes will result. Increasing the number of throws obviously also increases the resulting sixes (throwing 6000 times will produce around 1000 sixes).  

This also means that two hits with equal alignments will have different e-values if they resulted from searches against differing databases. In fact, of two identical hits (i.e. identical queries, hit sequences and identical resulting alignments) one may be highly significant, whilst the other may be insignificant. This is then caused by the differing database sizes.

Similarity vs. significance 

Often, e-values are used as an indication of the quality of a BLAST hit, i.e. the goodness of the underlying alignment. From the above, it should be clear that this is, at best, an imprecise way of determining goodness of BLAST hits. Some of the confusion often surrounding this is caused by people forgetting that the e-value quantifies the possibility of random hits, and nothing more. Accepting this, it is clear that the interpretation of the e-value does not change with database size; an e-value of 1e-50 is clearly significant (not caused by chance) no matter what database produced the hit.  On the other hand, the bitscores (i.e. the qualities of the underlying alignments) that produced these e-values of 1e-50 may differ quite a bit. For these two hits with identical e-values, the bitscore of the hit produced by the largest database will be better than the bitscore of the other hit.