Human Proteome Folding - Phase 1 Thesis

This thesis covers the authors efforts to develop a method to use ab initio protein structure prediction to detect distant homologs and use the homologs to annotate proteins from the genome of Saccharomyces cerevisiae.

Paper Title:

"Genome-wide structural and functional protein characterization by ab initio protein structure prediction"

Abstract:

Very little is known about a considerable part of all proteins and it is time consuming and expensive to study each individual protein to determine its function, structure and cellular role. Proteins retain structural, functional and sequential characteristics from ancestral proteins and hence two proteins that share a common ancestor, i.e. are homologs, will to some extent have similar sequence, structure and function. One way to learn something about a protein is to identify its homologous and use information from those homologs to annotate the protein of interest. Close homologs with a common ancestor can be detected using sequence alone, but more distant homologs cannot. Structure is more conserved than sequence and enables detection of a common ancestor between more distantly related proteins and thereby also enabling transfer of information to a larger fraction of the uncharacterized proteins. This thesis covers my efforts to develop a method to use ab initio protein structure prediction to detect distant homologs and use the homologs to annotate proteins from the genome of Saccharomyces cerevisiae.

The ab initio protein structure prediction software used in this thesis, Rosetta, can predict a protein’s tertiary structure using the amino acid sequence alone. Rosetta works by reducing the search space by approximating the local conformation with conformations from the protein data bank, and judging the over all fitness of the simulated protein structure through a statistically derived energy function. The program has been successful in the last three Critical assessment of techniques for protein structure prediction (CASP) and the results from the last CASP is reported in Paper I. Distant homologs can be detected by comparing the structures generated by Rosetta with structures from the Protein Data Bank (PDB). In general, however, such a comparison is noisy, that is, gives many answers, of which only a few are correct. The noise can be filtered out by utilizing the fact that there is a strong relationship between protein function and protein structure, and either use functional information from a database or infer functional information from one or more experimental high-throughput technologies. This idea was tested in Paper II were 100 proteins were investigated using protein structure prediction, yeast two hybrid, fluorescent microscopy and mass spectrometry. The data from all four technologies was integrated and 77% of the proteins were assigned a function.

Data integration is very labor-intensive when done by hand, and the amount of information generated for each protein investigated is substantial. Everything needs to be automated and all data have to be stored and managed in an efficient way to be able to apply this technology on a genome-wide scale. Paper III and Paper IV cover information management, that is, how the data used and produced in the project is organized and stored. Paper V reports both how we automated the integration process using the software described in Paper I and II and the application of the technology to the genome of Saccharomyces cerevisiae.

Access to Paper:

To review the paper, please click here.

Human Proteome Folding - Phase 1 Thesis - Lund University