Genome Comparison Overview


Project Status and Findings:   Information about this project is provided on the web pages below and by the project scientists on the Genome Comparison website. For the latest status report, please go to the Genome Comparison status report. To comment or ask questions about this project, please submit a post in the Genome Comparison Forum.

Improving protein functional annotation in databases

Over the years, scientists have been comparing gene sequences of different organisms against each other to determine if there are similarities between them. Gene sequences with similarities may have functional similarities as well. Thus, a scientist studying a gene sequence of unknown function may get important clues about the role it plays in an organism by comparing it to a similar gene sequence of known function in another organism.

The challenge is that as scientists discover new pieces of information, they enter it into one of a number of databases that contain information on gene sequences. Over the years, a rather large body of secondary information (structural, functional, similarities to other entries and a variety of cross-references) has been attached to protein database entries. Once such information is entered, it rarely gets updated or corrected. Thus, annotation of predicted protein function is often incomplete, uses non-standardized nomenclature or can be incorrect when inferred from previous, incorrectly annotated sequences. Additionally, many proteins are composed of several structural and/or functional domains (modules comprising distinct evolutionary, functional and structural units), which can be overlooked by automated annotation procedures.

The main objective of the Genome Comparison project is to perform, for the first time, a complete pairwise comparison between all predicted protein sequences, obtaining similarity indices that will be used, together with standardized Gene Ontology (www.geneontology.org/), as a reference repository for the annotator community and providing an invaluable data source for biologists. The sequence similarity comparison program used in the Genome Comparison Project is called SSEARCH (W.R. Pearson [1991] Genomics 11:635-650), a freely available implementation of the Smith-Waterman rigorous algorithm (T. F. Smith and M. S. Waterman [1981] J. Mol. Biol. 147:195-197), which finds the mathematically best local alignment between pairs of sequences. As scientists sequence new genomes from additional organisms, they can add those to this database and compute the comparisons, contributing new information to other scientists.

As a result, precise annotation, correction of inconsistencies, and assignment of possible functions to hypothetical proteins of unknown function will be possible. Moreover, proteins with multiple domains and functional elements will be correctly spotted. Even distant relationships will be detected. This will improve the quality and interpretation of biological data and our understanding of biological systems, host-pathogen and environmental interactions.