Analysis Underway on 30 Terabytes of Data from the Uncovering Genome Mysteries Project


The Uncovering Genome Mysteries data (all 30 terabytes) was transferred to the research teams in Brazil and Australia this year. Now, the researchers are analyzing this vast amount of data, and looking for ways to make it easy for other scientists and the public to understand.



Background

Last year, World Community Grid volunteers completed the calculations for the Uncovering Genome Mysteries project, which examined approximately 200 million genes from a wide variety of life forms to help discover new protein functions. The project’s main goals include:

  • Discovering new protein functions and augmenting knowledge about biochemical processes in general
  • Identifying how organisms interact with each other and the environment
  • Documenting the current baseline microbial diversity, allowing a better understanding of how microorganisms change under environmental stresses, such as climate change
  • Understanding and modeling complex microbial systems

Transferring 30 Terabytes of Data

The data generated by World Community Grid volunteers has been regrouped on the new bioinformatics server at the Oswaldo Cruz Foundation (Fiocruz), under the direction of Dr. Wim Degrave. Additionally, a full copy of all data has been sent to co-investigator Dr. Torsten Thomas and his team from the Centre for Marine Bio-Innovation & the School of Biological, Earth and Environmental Sciences at the University of New South Wales in Sydney, Australia. At the University of New South Wales, the results from protein comparisons will help to interpret the analyses of marine bacterial ecosystems, where micro-organisms, coral reef, sponges and many other intriguing creatures interact and form their life communities. The dataset, more than 30 terabytes under highly compressed form, took a few months to be transferred from Brazil to Australia.

Data Processing and Analysis at Fiocruz

The Fiocruz team has been busy with the further processing of the primary output of the project. In the workflow, raw data are expanded and deciphered, associated with the correct inter-genome comparisons, checked for errors, tabulated, and associated with many different data objects to transform that into meaningful information.

The team is dealing with the rapidly growing size of the database, and purchased and installed new hardware (600 Tb) to help accommodate all the data. They also wish to build a database interface that appeals to the general public interested in biodiversity, and not only to scientists who specialize in functional analysis of encoded proteins in genomes of particular life forms.

Some of the data are currently being used in projects such as vaccine and drug design against arboviruses such as Zika, dengue, and yellow fever viruses, but also for understanding of the interaction of bacteria with their environment and how this reflects in their metabolic pathways, when free living bacteria are compared with their close relatives that are human pathogens, such as Mycobacterium tuberculosis versus environmental mycobacteria.

Searching for Partnerships

Fiocruz is looking for partnerships that would add extra data analytics and artificial intelligence to the project. The researchers would like to include visualizations of functional connections between organisms as well as particularities from a wide variety of organisms, including deep sea thermal vent archaeal bacteria; bacteria and protists (any one-celled organism that is not an animal, plant or fungus) from soil, water, land, and sea or important for human, animal, or plant health; and highly complex plant, animal, and human genomes.

We thank everyone who participated in the World Community Grid portion of this project, and look forward to sharing more updates as we continue to analyze the data.


Related Articles