Genome Comparison



What is Genome Comparison?

Genome Comparison is a project of the Bioinformatics Team at the Department of Biochemistry and Molecular Biology of Fiocruz that used the compute power of World Community Grid to calculate the sequence similarity level among the whole protein content encoded in completely sequenced genomes of hundreds of organisms, including humans and several other species of medical, commercial, industry, or research importance. The calculated similarity indices will be used, together with standardized Gene Ontology, as a reference repository for the annotator community, providing an invaluable data source for biologists.



Why did the Genome Comparison project compare protein sequences?

Only a fraction of the predicted protein content encoded in completely sequenced genomes has actually had their biological function and expression confirmed through laboratory analysis. The assignment of predicted biological functions and structural features to raw sequence data is called annotation, and is accomplished mostly by comparing them to predicted proteins or protein coding genes with information stored in different public domain databases around the world. However, annotation is often incomplete, uses non-standardized nomenclature or can be incorrect when inferred from previous incorrectly annotated sequences. Thus, an all against all controlled comparative database would be of great use as a reference.



How proteins were compared in the Genome Comparison Project?

Biological sequences (DNAs, RNAs, and proteins) are mostly compared in pairs through a process called pairwise sequence alignment, which consists of putting two sequences side-by-side in such a way that the number of identical positions between them is maximized. The sequences can be globally (taking the whole sequences) or locally (taking parts of the sequences) aligned, depending on the context and the purpose. The sequence similarity comparison program used in the Genome Comparison Project is called SSEARCH (W.R. Pearson [1991] Genomics 11:635-650), a freely available implementation of the Smith-Waterman rigorous algorithm (T. F. Smith and M. S. Waterman, [1981] J. Mol. Biol. 147:195-197) (algorithm is an organized procedure for performing a given type of calculation or solving a given type of problem), which finds the mathematically best local alignment between pairs of sequences.



What are the potential benefits of the Genome Comparison Project?

  • The resulting all against all comparative database will be of great use as a reference for many research projects on functional aspects, biochemical pathways, evolutionary aspects, and an invaluable source for correct annotation of previously sequenced and newly obtained genome sequences
  • Precise annotation, assignment of possible functions to hypothetical proteins of unknown function, and the description of evolutionary relationships between proteins will be a major step forward towards our understanding of genome composition, genome evolution and cellular function
  • The contribution to the understanding of host-pathogen relationships, and the means to develop new drugs and vaccines, will be of utmost benefit to the scientific community at large
  • Research on biodiversity and new organisms will greatly benefit from reliable comparative data
  • Future new sequence releases will build upon the growing cross-referenced database



What is the status of the Genome Comparison project?

The Genome Comparison project was completed in July, 2007. You may read about the Genome Comparison project here. Findings from the Genome Comparison research scientists will be posted here.



How did the Genome Comparison software work?

The software automatically downloaded small pieces of data (predicted protein sequences) and performed sequence comparisons to accurately calculate the similarity level among them. After the information was processed by members computers, the results were sent by World Community Grid to Fiocruz where they are being analyzed by the Bioinformatics Team at the Department of Biochemistry and Molecular Biology. Large-scale comparative analysis applying Smith-Waterman algorithm is computationally intensive and demanded exceptionally huge computational power, which is why it was a perfect project for World Community Grid.