OP-11 Predicting protein stability changes from sequence with Support Vector Machines
Emidio Capriotti (1), Piero Fariselli (1), Remo Calabrese (1), Rita Casadio (1)
1) Laboratory of Biocomputing, CIRB/Department of Biology - University of Bologna
Motivation: The prediction of protein stability change upon mutations is a key problem for understanding protein folding and misfolding. Presently methods are available to predict stability changes only when the atomic structure of the protein is available. Methods addressing the same task starting from the protein sequence are however necessary in order to complete genome annotation, especially in relation to single nucleotide polymorphisms (SNPs) and related diseases.
Results: We develop a method based on support vector machines (SVM) that starting from the protein sequence predicts the sign and the value of free energy stability change upon single point mutation. We show that the accuracy of our predictor is as high as 77% in the specific task of predicting the G sign related to the corresponding protein stability. When predicting the G values, a satisfying correlation agreement with the experimental data is also found. As a final blind benchmark, the predictor is applied to proteins with a set of disease-related SNPs, for which thermodynamics data are also known. We found that our predictions corroborate the view that disease-related mutations correspond to decrease of protein stability.
OP-12 Fast Protein Classification with Multiple Networks
Koji Tsuda (1), HyunJung Shin (1), Bernhard Schoelkopf (1)
1) Max Planck Institute for Biological Cybernetics
Support vector machines (SVM) have been successfully used to classify proteins into functional categories. Recently, to integrate multiple data sources, a semidefinite programming (SDP) based SVM method was introduced [Lanckriet et al, 2004]. In SDP/SVM, multiple kernel matrices corresponding to each of data sources are combined with weights obtained by solving an SDP. However, when trying to apply SDP/SVM to large problems, the computational cost can become prohibitive, since both converting the data to a kernel matrix for SVM and solving SDP are time and memory demanding. Another applicationspecific drawback arises when some of the data sources are protein networks. A common method of converting the network to a kernel matrix is the diffusion kernel method, which has time complexity of (n3), and produces a dense matrix of size n x n.
We propose an efficient method of protein classification using multiple protein networks. Available protein networks, such as a physical interaction network or a metabolic network, can be directly incorporated. Vectorial data can also be incorporated after conversion into a network by means of neighbor point connection. Similarly to the SDP/SVM method, the combination weights are obtained by convex optimization. Due to the sparsity of network edges, the computation time is nearly linear in the number of edges of the combined network. Additionally, the combination weights provide information useful for discarding noisy or irrelevant networks. Experiments on function prediction of 3588 yeast proteins show promising results: the computation time is enormously reduced, while the accuracy is still comparable to the SDP/SVM method.
OP-13 Less is More: Towards an Optimal Universal Description of the Universe of Protein Folds
Joseph Szustakowski (1), Simon Kasif (1), Zhiping Weng (1)
1) Boston University, Department of Biomedical Engineering, 44 Cummington Street, Boston, MA 02215
ABSTRACT: Motivation: Identification and characterization of protein structure regularities can reveal the mechanisms governing protein structure, function and evolution. Here we focus on an intermediate level of regularity. We developed automated methods to systematically construct a dictionary of supersecondary structures that can be used as "protein parts" to describe fold-sized structures.
Results: The dictionary was constructed by aligning representative structures of all known folds, clustering similar sub-structures, and selecting the most descriptive sub-structures in a Minimum Description Length fashion. We show that the dictionary is compact and descriptive, capable of describing a substantial fraction of all known protein folds. We performed simulations using independent sets of training and testing folds. Dictionaries generated using the training set had high coverage over the folds in the testing set, suggesting that dictionary entries reflect general features of protein structures and should be capable of describing novel protein folds.
OP-14 Evaluating the usefulness of protein structure models for molecular replacement
Alejandro Giorgetti (1), Domenico Raimondo (1) Adriana Erica Miele (1), Anna Tramontano(1) 1) Dept. Biochemical Sciences University of Rome
ABSTRACT: Motivation: We investigate the relationship between the quality of models of protein structure and their usefulness as search models in molecular replacement, a widely used method to experimentally determine protein structures by X-ray crystallography.
Results: We used the available models submitted to the Critical Assessment of Techniques for Protein Structure Prediction (CASP) to verify in which cases they can be automatically used as search templates for molecular replacement. Our results show that there is a correlation between the quality of the models and their suitability for molecular replacement, but that the traditional method of relying on sequence identity between the model and the structure to be solved is not diagnostic for the success of the procedure.
Availability: Additional data are available at http://cassandra.bio.uniroma1.it/mr-results-casp.html
CONTACT: firstname.lastname@example.org / email@example.com