Fiocruz Genome Comparison

Genome Comparison Project: A Layperson's Explanation

Genes, genomes and genomic data

An organism's genome is its complete set of genetic information. The genome is made up of genes, the hereditary units in all living organisms. Genes are responsible for the physical development, the metabolism and (to some extent) the behavior of those organisms. Most genes encode the proteins that largely dictate the biochemical reactions carried out by cells. Other genes produce very important RNA molecules or do not encode any molecule at all, but are important from a structural or regulatory point of view.

Computer analysis has predicted which regions of the genome encode for the proteins (several hundreds or thousands of proteins in bacteria to about 30,000 proteins and their variants in humans). However, the prediction of the cellular functions of those derived proteins (structural, enzymes, transporter and signaling functions, etc.) is mostly hypothetical. The vast majority of probable functions have been attributed by in silico (computer) analysis, using sequence comparison with proteins in databases. However, thus far, only a small fraction of predicted proteins have had their functions confirmed by laboratory experiments.

Since the 1990's, international efforts have led to the determination of the complete genetic code of more than 400 organisms (www.genomesonline.org), such as bacteria, yeasts, protozoan parasites, invertebrates and vertebrates, including Homo sapiens, and plants. More than 1,500 genome investigations are currently ongoing, representing medical, commercial, environmental and industrial interests or important research models. As a result of these investigations, parts of the genome sequence have been identified. These new genome sequences are becoming available at an ever faster pace, adding to the fragmentary data available from thousands of organisms.

Protein coding genes and their annotation

Release 19 of RefSeq (September 2006), a reference sequence collection (www.ncbi.nlm.nih.gov/RefSeq), registers more than 2.8 million predicted protein coding genes, from 3,774 organisms, including viruses. Most of the identifications of putative protein encoding genes and their associated protein sequences together with their functional annotation (the assignment of predicted biological functions and structural features to raw sequence data) have been done using bioinformatics tools and database comparisons. Such structural and functional annotation has been building up over the years, based on cross-referencing between the growing databases. While several efforts are under way to construct a carefully verified reference set of proteins where attributed function has been experimentally verified, using a reference set of nomenclature for gene, protein and cellular function (called Gene Ontology – GO [www.geneontology.org]) and standardized annotation rules, such a database does not yet exist.

In addition, over the years, a large body of secondary information (structural, functional, similarities to other entries and a variety of cross-references) has been added to protein database entries. Once such information is entered, it rarely gets updated or corrected. Thus, annotation of predicted protein function is often incomplete, uses non-standardized nomenclature or can be incorrect when inferred from previous, incorrectly annotated sequences. Additionally, many proteins are composed of several structural and/or functional domains (modules comprising distinct evolutionary, functional and structural units), which can be overlooked by automated annotation procedures.

The Genome Comparison Project: Improving protein functional annotation in databases

The main objective of the Genome Comparison Project is to perform, for the first time, a complete pairwise comparison between all predicted protein sequences, obtaining similarity indices that will be used, together with standardized Gene Ontology (www.geneontology.org/), as a reference repository for the annotator community, providing an invaluable data source for biologists. The sequence similarity comparison program used in the Genome Comparison Project is called SSEARCH (W.R. Pearson [1991] Genomics 11:635-650), a freely available implementation of the Smith-Waterman rigorous algorithm (T. F. Smith and M. S. Waterman [1981] J. Mol. Biol. 147:195-197), which finds the mathematically best local alignment between pairs of sequences.

As a result, precise annotation, correction of inconsistencies, and assignment of possible functions to hypothetical proteins of unknown function will be possible. Moreover, proteins with multiple domains and functional elements will be correctly spotted. Even distant relationships will be detected.

The biological systems within a cell are of great complexity, and our understanding of the whole protein content of a cell, protein interactions, biochemical pathways and their regulation is only very partial. A database reflecting all the primary sequence relationships between the corresponding proteins from all known organisms at the genomic level will be invaluable to improve our understanding of this complexity.

Additionally, the database will benefit many experimental approaches to the analysis of the biodiversity on our planet. Scientists investigating environmental samples or fragmentary analysis of new organisms will be able to use the results of the Genome Comparison analysis to investigate different aspects of the genetics and biochemistry of these organisms. Moreover, the description and analysis of evolutionary relationships between proteins (and microorganisms) based on such genome analysis will be a major step forward towards our understanding of the evolution of genome structure and the biochemical and structural organization of organisms. Large scale initiatives such as the description of the Tree of Life and cataloging the Biodiversity will greatly benefit from the Genome Comparison database.

New drugs, vaccines and diagnostics

Scientific research and (bio)technological development based on genomics are making increasing progress towards new diagnostics, as well as the development of new drugs and vaccines. Comparative genomics and the knowledge of biochemical pathways and cellular processes are of utmost importance in this field. On the other hand, functional analysis and protein interaction studies are of key importance to understand how microorganisms, cells in a multi-cellular organism, and pathogens interact with their environment (and/or hosts), opening up the way for the design of new control strategies for infectious and parasitic diseases, as well as metabolic and chronic or degenerative diseases.

World Community Grid and genome functional annotation

Stringent pairwise sequence comparisons are quite computing-intensive operations, and an all against all comparison of predicted proteins from all completely sequenced genomes today is a task almost impossible to achieve without support from World Community Grid's very large grid structure. The resulting information matrix will form an invaluable database that can be continuously incremented as new genome sequences become available and will form the basic material for many functional studies within the scientific community at large.

Return to Top