Research update from the Microbiome Immunity Project team (May 2023)

The Microbiome Immunity Project team has published a new article about the structure and function of human microbiome proteins.


While bacteria in general can be harmful to humans, as it can cause diseases such as pneumonia, meningitis, strep throat, food poisoning (Escherichia coli and Salmonella), bacteria in the human gut also have protective functions. For many years scientists have studied the various types of bacteria in the human body and characterized those that could be harmful or useful, but the vast majority remain poorly characterized. With an estimated 3 million unique bacterial genes known to exist, studying all of the resulting proteins and determining their function is a complex task.

The Microbiome Immunity Project started in August 2017 with the goal to speedup protein molecule research thanks to the power of grid computing. By the time the computation finished in December 2021, World Community Grid volunteers donated nearly 146,000 CPU years to the MIP. This effort enabled the team to predict the structure of almost 200,000 proteins, discover over 150 new protein shapes (folds), describe several previously unknown functions of protein structures, and nearly double the number of annotated proteins in the human gut microbiome.

Protein Universe Paper

The MIP team has recently published their paper in Nature Communications titled Sequence-structure-function relationships in the microbial protein universe[1]. This article explores the notion that proteins with similar sequences will not necessarily create similar structures that perform the same functions, contrary to a long-held belief among scientists. Using the MIP database that WCG volunteers helped create, the paper explores examples where proteins similar in sequence perform different functions.

The MIP team analyzed 2 million protein sequences that had no known structure in any other database. They then used Rosetta and DMPFold methods to predict the protein structures in a three-step process. To filter out low quality predictions, they first determined a threshold amount of coil residues (that form helices) above which the structures were not reliable. Second, they used confidence and quality assessment scores to determine the predicted quality[2]. Third, priority was given to models in agreement between the 2 methods. In the end, about 200,000 models were identified and characterized.

Figure 1. Flowchart of the process to arrive at ~200,000 de novo protein models covering a diverse sequence space[1]. Reused from Koehler et al., Nature Communications, 2023 paper with permission, under the Creative Common license CCBY3.0.

This new database provides a unique view into gut microbiome, as it is different in coverage and scope from previous efforts. The researchers compared the set of 200,000 models to the PDB90 database from the Protein Data Bank and other protein structure databases to determine that the predicted structures were novel. A factor affecting the difference of the two databases is the presence of proteins from Archaea and Bacteria in the MIP database, a species poorly represented in other databases. Moreover, the sequence size analyzed is smaller than in other databases (protein structures predicted were ranging from 40-200 residues in size), and less biased towards proteins of interest, as is the case for PDB90 (that contains proteins more prone to structure determination and possible pharmaceutical targets, leading to multiple close variants of the same structures). This makes MIP complementary to other databases.

To test if this had any effect on the sequence-structure-function relationship, they analyzed the structural and functional similarities of 5,000 structures from the MIP and 1,000 baseline structures from the PDB databases. When correlating structural and functional similarity, the majority of pairs showed the expected behavior (i.e., different structure, different function or same structure, same function), but a notable number of pairs behaved contrary to their expectations. Analyzing the discordant pairs, the authors found that more generic functions can be performed by multiple types of structures, while very specific mechanisms are carried out only by unique structures. 

Figure 2. Heatmaps that show the functional similarity between pairs of protein clusters. Clusters 158 and 153 (top left and right) cover proteins with similar structures (bottom left and right) and diverging functions[1]. Reused from Koehler et al., Nature Communications, 2023 paper with permission, under the Creative Common license CCBY3.0.

This study paves the road to the development of tools to explore and predict site-specific protein functions for other organisms as well, in order to better understand the role of specific structures and functions related to biological functions. 

“Until now, we have been talking about the microbiome in the same language one would use to describe the biodiversity of a rainforest,” Dr. Tomasz Kosciolek, a member of the MIP research team says. “We hope to start talking about it in more mechanistic terms, like what molecules might amplify, inhibit or change certain biological processes.”

Thank you to the Microbiome Immunity Project team for providing this update. If you have any comments or questions, please leave them in this thread for us to answer. Thank you for your continued support that accelerates large-scale scientific research for the betterment of humanity.

WCG team


  1. Koehler Leman, J., Szczerbiak, P., Renfrew, P.D. et al. Sequence-structure-function relationships in the microbial protein universe. Nature Communications 14, 2351 (2023).
  2. Zhang Y, Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins. 2004 Dec 1;57(4):702-10. doi: 10.1002/prot.20264. Erratum in: Proteins. 2007 Sep 1;68(4):1020. PMID: 15476259.