About the Project

The Problem

It often seems as though humankind is in a state of conflict with the natural world. Pathogens are evolving resistance to many of today's important antibiotics. We are consuming many of Earth's valuable resources at an unsustainable rate, while pollution in the air and water threatens the health and livelihoods of many communities.

Fortunately, we are beginning to understand that nature may have already developed solutions to many of these problems, and they are hidden in plain sight: in forests, oceans and soils. For example, studies of exotic soil samples and plant extracts have revealed substances with the ability to kill particular kinds of disease-causing bacteria. We have found exotic tropical plants that show promise as efficient, sustainable fuel sources. Microorganisms have been used to clean water in sewage treatment plants and even help consume oil spills. Most of these discoveries were uncovered through time-consuming trial and error. If we could better understand the amazing range of natural powers, we might be able to speed up development of practical technologies and solutions.

One approach to identifying nature's hidden "superpowers" is to analyze the genetic makeup of different organisms to help us understand how they function. Traditionally, this has been a very expensive and time-consuming process, but in recent years scientists have developed more affordable and effective methods to decode DNA. The result is an explosion of genomic data from animals, plants and particularly microorganisms. After DNA has been decoded, scientists must conduct further studies to discover the function of each gene and its corresponding protein. Each gene specifies the sequence of amino acids to assemble into a molecular chain which is then folded into a protein molecule. This is also known as the protein sequence.

Genes and their corresponding proteins play important roles in many life processes, and as a result, are often valuable in medicine and industrial applications. Some proteins are chemical factories, called enzymes, which can break down molecules into simpler components or help construct more complex molecules. Other proteins form the building blocks of all kinds of structures in plants and animals. Still other proteins play roles in controlling all kinds of activity in cells in response to various stimuli.

It is clear that there is a wealth of useful knowledge to be found by understanding what unknown genes and their corresponding proteins do. This knowledge might even help scientists solve many of the world's most pressing problems. However, there are two important challenges to this effort:

First, we are rapidly losing many valuable potential sources of DNA from diverse life forms. This is because many acres of unexplored pristine forests and water habitats are disappearing due to human development, climate change and other factors. We are losing the rich resources in nature that harbor valuable, yet hidden, solutions to the world's problems. We need more efficient and effective ways to discover what nature still has to tell us before it is too late.

And secondly, if we want to search for useful genes in unknown organisms, the scale of the task is staggering. Each organism may have thousands of genes, and there can be tens of thousands of organisms in even a small sample of water or soil. If we want to unlock nature's hidden powers, we need new methods to deal with the "big data" from the hundreds of millions of genes that are being decoded.

The Proposed Solution and Justification

Uncovering Genome Mysteries expects to examine close to 200 million genes from a wide variety of life forms, such as seaweeds from Australian coastlines and microbes found in Amazon river samples. Those genes are being compared against each other to assess their similarity. When two genes are similar, and the function of one gene is already known, this allows scientists to make educated guesses about the function of the other gene. This represents about 20 quadrillion (2x1016) comparisons. The total computation time is projected to take the equivalent of one computer running continuously for 40,000 years--no small feat, but feasible thanks to the computational power of World Community Grid. While DNA sequences from all forms of life will be processed, microorganisms will receive a special focus.

Modern DNA sequencing technologies can now rapidly determine millions of DNA sequences at reasonable costs. New technological breakthroughs are being developed to augment this capacity by several orders of magnitude. This will allow scientists to determine all DNA sequences hidden in the unseen microbial world. They have already been doing this for many medical and industrially important unicellular and multicellular organisms, animals, plants and human individuals over the last few years. Since the 1990s, genome analyses have concentrated on studying three kinds of organisms: "model organisms" in biology, because they had been studied for decades in laboratories (such as E. coli, yeast, helminthes and mice); important human, animal and plant pathogens (like the bacteria that cause tuberculosis and leprosy, or crop pathogens); and finally, representative organisms in the "Tree of Life". In more recent decades, many scientists around the world have started sequencing and analyzing metagenome data from many additional biomes, enriching our knowledge about biodiversity in air, land and sea, from the arctic to tropical forests. From this work, a very complex picture of the diversity of living organisms on our planet is emerging.

The daunting task of interpreting this now huge and exponentially growing amount of DNA sequence data ("big data") is not trivial. DNA sequence information is only meaningful and useful if it can be decoded and interpreted by comparing it to other gene sequences of known or unknown function, a process called "genome annotation," while mapping variations. This decoding and annotation process requires vast amounts of computational power, and is currently a major bottleneck in making sense of genomes that have already been sequenced.

The Uncovering Genome Mysteries project aims to harness the computational power of World Community Grid to give biological meaning to gene-sequencing data available for microorganisms and other life forms. This will be done on the level of comparison between individual microbial genomes as well as on the level of the genetic information of entire microbial communities for the environment (metagenomes). Decoding genomes and metagenomes will provide new information on the functional role and diversity that microorganisms play in the environment. Comparison of this information with known functional data from other organisms already studied in greater detail will be crucial for the interpretation and annotation of the DNA codes.

Project Goals

The specific goals of the Uncovering Genome Mysteries project are:

  • To create a database of protein sequence comparison information, based on the DNA found from diverse sources, for all scientists to reference.
  • To discover new gene functions, augmenting our knowledge about biochemical processes in general.
  • To find how organisms interact with each other and environment.
  • To document the current baseline microbial diversity, allowing us to understand how microorganisms change under environmental stresses, such as climate change.
  • To better understand and model complex microbial systems.

While the immediate computational results of this project are only an early step in achieving the above goals, they will be ultimately useful in many ways. For example, the resulting knowledge should help identify, design and produce new antibiotics and drugs against chronic diseases, as well as new enzymes for industrial applications, such as food processing, chemical synthesis or the production of green plastics or biofuels. In the long-term this knowledge should help us manage the diverse organisms' important functions in the world's ecosystem, in all environments, in industrial settings, and in human, animal and plant interactions.