A paper about innovations and performance improvements that were used to process the results data from the Nutritious Rice for the World project. The paper was published online on March 17, 2014 in the journal Bioinformatics.
"fast_protein_cluster: parallel and optimized clustering of large-scale protein modeling data"
Lay Person Abstract:
The Nutritious Rice for the World project used World Community Grid to predict the structures of proteins in a number of genomes of rice. This is the first step in understanding the functions of those proteins and how they play a role in better nutrition and resistance to environmental stresses such as drought, pests and disease. The project has been processing the results computed using World Community Grid. To obtain more accurate structure predictions, various methods were used to combine information from groups of similar protein models to identify the most accurate models.
The Nutritious rice for the World project generated sets of protein models that were too large to be clustered into subsets by existing techniques which were too slow and used too much memory. New algorithms were devised that were orders of magnitude faster. The new software also exploits GPUs, multi-core CPUs, and on-chip co-processors to make the structural comparisons faster. The new methods use less memory and result in significantly more accurate structure predictions. The software and code has been made available to be used by protein folding community.
Using the new software, the entire Nutritious Rice for the World dataset was analyzed in 6 weeks.
Technical Abstract:
Motivation: fast_protein_cluster is a fast, parallel and memory efficient package used to cluster 60 000 sets of protein models (with up to 550 000 models per set) generated by the Nutritious Rice for the World project.
Results: fast_protein_cluster is an optimized and extensible toolkit that supports Root Mean Square Deviation after optimal superposition (RMSD) and Template Modeling score (TM-score) as metrics. RMSD calculations using a laptop CPU are 60× faster than qcprot and 3× faster than current graphics processing unit (GPU) implementations. New GPU code further increases the speed of RMSD and TM-score calculations. fast_protein_cluster provides novel k-means and hierarchical clustering methods that are up to 250× and 2000× faster, respectively, than Clusco, and identify significantly more accurate models than Spicker and Clusco.
Availability and implementation: fast_protein_cluster is written in C++ using OpenMP for multi-threading support. Custom streaming Single Instruction Multiple Data (SIMD) extensions and advanced vector extension intrinsics code accelerate CPU calculations, and OpenCL kernels support AMD and Nvidia GPUs. fast_protein_cluster is available under the M.I.T. license.
Access to Paper:
To view the paper, please click here.