Description of the genome query for
Triplet-controlled four-taxon tree analysis
Constructing four-taxon trees:
There are three possible phylogenetic topologies, (AB)-(CD), (AC)-(BD) and (AD)-(BC),
for an orthologous gene in four taxa, where orthology is defined as the reciprocal best match
in all pairwise comparisons between the four taxa.
A valid topology is one where the interlink branch is of positive length, and an invalid topology is one where
this branch is of negative length.
We estimate the length of this interlink branch (positive or negative) as follows,
where e.g. AB is the distance based on the mean normalized BLASTP bit score between a locus in genome A and its
reciprocally best matching locus in genome B:
for topology 1, e1 = (AC + AD + BC + BD - 2AB - 2CD) / 4,
for topology 2, e2 = (AB + AD + BC + CD - 2AC - 2BD) / 4,
and for topology 3, e3 = (AB + AC + BD + CD - 2AD - 2BC) / 4.
We exclude all cases in which the value of the positive interlink branch length is less than the user selected value,
or in which two topologies give positive values for the interlink branch length.
We recognize that the assumption of additivity for normalized BLAST scores on which this calculation is based
is untested and likely only approximately true.
Detecting events of lateral genetic transfer by triplet-controlled four taxon tree analysis:
Based on the global genomic phylogeny, where C and D are both assumed to be outgroups to A and B, then a four-taxon tree for a given shared ortholog will either agree with this genomic topology, or disagree. If a gene trees robustly as (A,C)-(B,D), then either A exchanged this gene with C, or B exchanged this gene with D. We can decide if the exchange was between A and C or between B and D, by varying the choice of taxa in the C or the D positions. For each triplet of genomes A, B, and C, we therefore count how many times A pairs with B, A pairs with C, and B pairs with C. If more than the user-selected proportion of the AC exchanges are supported over all D (D > 1), the exchange is deemed to be between A and C. Similarly, if more than the user-selected proportion of the BC exchanges are supported over all D (D > 1), the exchange is deemed to be between B and C.
Several kinds of false positives may be detected instead, however, arising from reciprocally lost paralogs, unequal rates of evolution, or, when sequences are very closely related, measurement error. We suggest following this analysis with full, multiple alignment-based trees of the candidates.
Copyright ©1998-2005 NeuroGadgets Inc. ©2006 University of Queensland
