Description of the genome query for

Gene families

For each ORF in the chosen genome, and for each of the ORF's matches to the database that is better or equal to the specified BLASTP cutoff e-value, a check is made to see if the hit is to a sequence other than self, but from the same genome.

Once the list of intragenomic matches for each ORF has been tabulated, the families are extended as follows.

e.g. With a dataset such as:

(A,B), (A,C), (A,F), (A,H),
(B,A), (B,C), (B,H), (B,P),
(C,A), (C,B), (C,F), (C,H), (C,P),
(D,W), (D,X),
(W,D), (W,X),
(X,D), (X,W),

the families reported would be: (A,B,C,F,H,P) and (D,W,X) even though there is no direct relationship between (A,P), or (B,F).

Finally, the mean family score is computed based on each pairwise score within the extended family.

Note that BLASTP scores (and e-values) are not strictly commutative, i.e., the score between A and B is similar but not identical to the score between B and A. The implications of this property with respect to this query are that: i) borderline matches may be missed or included as a consequence of the order of operations in the query's tabulation process; and ii) an ORF may appear in more than one family. The latter anomaly would occur in the following situation, where the dataset is, for example:

(A,B), (A,F), (A,J),
(B,A), (B,F),
(F,A), (F,B),
(R,W),
(W,J), (W,R).

The families reported would be (A,B,F,J) and (R,W,J). ORF J is in two families since (A,J) and (W,J) are better than threshold, and since (J,A) and (J,W) are not.

Problems:

  1. Multidomain genes link the families of their constituent domains.
  2. Some non-homologous ORFs may group together based on a shared motif, if the chosen BLASTP cutoff e-value is too low.

Discussion

Gene families are important to characterize if one is interested in gene evolution, genome evolution, or phylogenetics. Redundancy following a duplication or following an acquisition from another cell permits a specialization of gene product function, and an opportunity to invent novel functions. Redundancy also buffers the cell against faults in metabolism, and otherwise permits a more versatile regulatory strategy. In genome evolution, gene families may (when the genes are still very similar in sequence) permit rearrangement of the genome by homologous recombination. Finally, the issue of orthology and paralogy is of central importance in phylogenetic analysis. Characterizing gene families helps to provide the necessary reassurance that one's phylogenies are correct.

Examples:

  1. If one selects a reasonably stringent BLASTP cutoff e-value, one can be reasonably assured of finding true homologs. Thus, for example, selecting 1.0e-15 for gene families in Escherichia coli K12 yields 470 families, including 1659 of E. coli's 4289 ORFs. Some of these families contain genes that were similarly annotated (e.g., family #1 contains three aspartokinases), whereas others contain annotated and unannotated members (e.g., family #25 has 77 ATP-binding transporter domains and two unknowns; Hypothesis: the unknowns are ABC transporter domains), or solely unannotated members (e.g., several families). Because of the consistency of annotation within gene families, it is reasonable to hypothesize a general function for many of E. coli's unknown proteins; it at least makes it simpler to identify their function in the laboratory given this hint.

Copyright ©1998-2005 NeuroGadgets Inc. ©2006 University of Queensland

Back to our Home Page