Big Data and Big Plans: Next Steps for Uncovering Genome Mysteries

World Community Grid’s role in the Uncovering Genome Mysteries project has ended, but the research team’s work continues as they analyze the results of the calculations and prepare to apply the data to medical, agricultural, and other real-world applications.

Background

The Uncovering Genome Mysteries project began on World Community Grid in November 2014, with the aim of analyzing protein sequences to help understand how organisms function and interact with each other and the environment. The project began with 120 million predicted protein sequences from close to 150,000 organisms. These protein sequences and organisms represent a wide variety of known or uncharacterised life forms in our biosphere. They came from organisms in samples taken from a range of environments, including water and soil, as well as on and inside plants and animals. Additionally, 70 million sequences, derived from prospective analysis of genetic information from microbial marine ecosystems from Australia were added, with the objective to add to the identification of possible functionalities of these sequences. In July 2015, we added yet another 20 million newly predicted sequences of proteins.

Thanks to the enthusiastic contributions of more than 76,000 World Community Grid volunteers, all of these protein sequences were analyzed in approximately 24 months.

Uncovering Genome Mysteries has been a challenging and ambitious project. Analyzing all the predicted enzymes and other proteins encoded in the genetic information known thus far from of all the organisms and life forms from our biosphere is a large task. Due to the development of new sequencing technologies for fast and cheap determination of genetic code, additional basic information will become available at an accelerating rate, making it increasingly difficult to perform such a complete comparative analysis in the future.

Our daunting task of performing close to 100 quadrillion comparisons has now been completed. The resulting data is more than 30 terabytes of compressed information (more than 150 terabytes uncompressed), even though each comparison only resulted in a single line of numbers for only the very highest probability similarities between protein sequences.

Results to Date and Plans for the Future

So, what is next? The research team at Fiocruz has spent the last year designing and testing new algorithms to transform the output of the comparisons with distance calculations between the genomes of the organisms included. Scientific literature cites many different ways to do this, depending on the purpose of the analysis and the views on evolutionary biology.

The results of the Uncovering Genome Mysteries can be summarized as follows:

More complete and precise information is now available on the structure and function of proteins encoded by living organisms in our biosphere. More proteins are being studied and experimented with each day in the thousands of laboratories around the world, and by using results from the comparison performed through the project, functional parallels can be drawn for proteins that show structural similarity between organisms. This is particularly valuable when predicted protein fragments are compared from uncharacterised organisms, for example in environmental and ecology studies, such as those originated from the laboratory of co-investigator Dr. Torsten Thomas, and his team from the Centre for Marine Bio-Innovation & the School of Biological, Earth and Environmental Sciences at the University of New South Wales, Sydney, Australia. The resulting database with these functional annotations will be made publicly available as the next version of our protein comparison database, ProteinWorldDB, in the coming months.

Through comparison, new protein functions are discovered that can have medical, agricultural, technological or industrial applications. These can be as new biopharmaceuticals, bioinsecticides, biodegradation of waste, or enzymes for production of chemicals, but especially when part of new biochemical pathways in cells, that help laboratories to develop new green chemistry or energy production, or biosynthesis and transformation of new drugs. This also adds to the growing knowledge of biotechnology and synthetic biology.

The group at Fiocruz has developed new ways to compare genomes from different organisms. Traditionally, such analyses consider what is conserved between genomes, resulting in distance calculations that are used for phylogenetic studies and the estimation of evolutionary relationships between organisms. However, we feel that this is only part of the picture, and the Fiocruz team designed a new algorithm that also takes differences into account. This was coupled to a new visualization method for such comparisons, resulting in a markedly faster way to add new data to the picture. We hope that this method will enable us to keep track of data from new organisms that becomes available, adding results to the growing ProteinWorld DB database.

Thank you to all World Community Grid volunteers who supported this project, and we plan to keep in touch as we have further news about our ongoing research.

Big Data and Big Plans: Next Steps for Uncovering Genome Mysteries

Related Articles