## Biology by other means

### Papers

Back
Annotation of Metagenome Short Reads Using Proxygenes
Victor Markowitz, Daniel Dalevi, Konstantinos Mavrommatis, Natalia Ivanova, Ernest Szeto, Philip Hugenholtz and Nikos Kyrpides
Abstract:
Motivation:
A typical metagenome dataset generated using a 454 pyrosequencing platform consists of short reads sampled from the collective genome of a microbial community. The amount of sequence in such datasets is usually insufficient for assembly, therefore traditional gene prediction cannot be applied on such datasets. As a result, analysis of such metagenome datasets usually involves comparisons in terms of relative abundances of various protein families. The latter requires assignment of individual reads to protein families, which is hindered by the fact that short reads contain only a fragment, usually small, of a protein.
Results:
We have considered the assignment of pyrosequencing reads to protein families directly using RPS-BLAST against COG and Pfam databases and indirectly via proxygenes that are identified using BLASTx searches against protein sequence databases. Using simulated metagenome datasets as benchmarks, we show that the proxygene method is more accurate than the direct assignment. We introduce a clustering method which significantly reduces the size of a metagenome dataset while maintaining a faithful representation of its functional and taxonomic content.

Up Back
PhosphoPOINT: a comprehensive human kinase interactome and phospho-protein database
Chia-Ying Yang, Chao-Hui Chang, Ya-Ling Yu, Tsu-Chun Emma Lin, Sheng-An Lee, Chueh-Chuan Yen, Jinn-Moon Yang, Jin-Mei Lai, Yi-Ren Hong, Tzu-Ling Tseng, Kun-Mao Chao and Chi-Ying F Huang
Abstract:
Motivation:
To fully understand how a protein kinase regulates biological processes, it is imperative to first identify its substrate(s) and interacting protein(s). However, of the 518 known human serine/threonine/tyrosine kinases, 35% of these have known substrates, while 14% of the kinases have identified substrate recognition motifs. In contrast, 85% of the kinases have protein-protein interaction (PPI) datasets, raising the possibility that we might reveal potential kinase-substrate pairs from these PPIs.
Results:
PhosphoPOINT, a comprehensive human kinase interactome and phospho-protein database, is a collection of 4,195 phospho-proteins with a total of 15,738 phosphorylation sites. PhosphoPOINT annotates the interactions among kinases, with their down-stream substrates and with interacting (phospho)-proteins to modulate the kinase-substrate pairs. PhosphoPOINT implements various gene expression profiles and Gene Ontology cellular component information to evaluate each kinase and their interacting (phospho)-proteins/substrates. Integration of cSNPs that cause amino acids change with the proteins with the phospho-protein dataset reveals that 64 phosphorylation sites result in a disease phenotypes when changed; the linked phenotypes include schizophrenia and hypertension. PhosphoPOINT also provides a search function for all phospho-peptides using about 300 known kinase/phosphatase substrate/binding motifs. Altogether, PhosphoPOINT provides robust annotation for kinases, their down-stream substrates and their interaction (phospho)-proteins and this should accelerate the functional characterization of kinome-mediated signaling.
Availability:
PhosphoPOINT can be freely accessed in http://kinase.bioinformatics.tw/

Up Back
Functional Coherence in Domain Interaction Networks
Jayesh Pandey, Mehmet Koyuturk, Shankar Subramaniam and Ananth Grama
Abstract:
Motivation:
Extracting functional information from protein-protein interactions (PPI) poses significant challenges arising from the noisy, incomplete, generic, and static nature of data obtained from high throughput screening. Typical proteins are composed of multiple domains, often regarded as their primary functional and structural units. Motivated by these considerations, domain-domain interactions (DDI) for network-based analyses have received significant recent attention. This paper performs a formal comparative investigation of the relationship between functional coherence and topological proximity in PPIs and DDIs. This investigation provides a necessary basis for continued and focused investigation of DDIs as abstractions for functional characterization and modularization of networks.
Results:
We investigate the problem of assessing the functional coherence of two biomolecules (or segments thereof) in a formal framework. We establish basic intuitive characteristics of admissible measures of functional coherence, and demonstrate that existing, well-accepted measures are ill-suited to comparative analyses involving different entities (i.e., domains vs. proteins). We propose a statistically motivated functional similarity measure that takes into account functional specificity as well as the distribution of functional attributes across entity groups to assess functional similarity in a statistically meaningful and biologically interpretable manner.
Results on diverse data that includes high-throughput and computationally predicted PPIs, as well as structural and computationally inferred DDIs for diverse organisms show that (i) the relationship between functional similarity and network proximity is captured in a much more (biologically) intuitive manner by our measure, compared to existing measures, and (ii) network proximity and functional similarity are significantly more correlated in DDI networks than in PPI networks, and that structurally determined DDIs provide better functional relevance as compared to computationally inferred DDIs.
Contact:
Jayesh Pandey, jpandey@cs.purdue.edu

Up Back
Organization and integration of biomedical knowledge with concept maps for key peroxisomal pathways
Marcel Willemsen, Gerbert Jansen, Jasper Komen, Sander van Hooff, Hans Waterham, Pedro Brites, Ronald Wanders and Antoine van Kampen
Abstract:
Motivation:
One important area of clinical genomics research involves the elucidation of molecular mechanisms underlying (complex) disorders which eventually may lead to new diagnostic or drug targets. To further advance this area of clinical genomics one of the main challenges is the acquisition and integration of data, information and expert knowledge for specific biomedical domains and diseases. Currently the required information is not very well organized but scattered over biological and biomedical databases, basic text books, scientific literature and experts minds and may be highly specific, heterogeneous, complex and voluminous.
Results:
We present a new framework to construct knowledge bases with concept maps for presentation of information and the web ontology language OWL for the representation of information. We demonstrate this framework through the construction of a peroxisomal knowledge base, which focuses on four key peroxisomal pathways and several related genetic disorders. All 155 concept maps in our knowledge base are linked to at least one other concept map, which allows the visualization of one big network of related pieces of information.
Availability:
The peroxisome knowledge base is available from www.bioinformaticslaboratory.nl (Support => Web applications).
Contact:
a.h.vankampen@amc.uva.nl
Supplementary information:
Supplementary information is available from www.bioinformaticslaboratory.nl (Research => Output => Publications => KB_SuppInfo)

Up Back
An Integrative Approach for Predicting Interactions of Protein Regions
Sven-Eric Schelhorn, Thomas Lengauer and Mario Albrecht
Abstract:
Motivation:
Protein-protein interactions are commonly mediated by the physical
Contact of distinct protein regions. Computational identification of interacting protein regions aids in the detailed understanding of protein networks and supports the prediction of novel protein interactions and the reconstruction of protein complexes.
Results:
We introduce an integrative approach for predicting protein region interactions using a probabilistic model fitted to an observed protein network. In particular, we consider globular domains, short linear motifs, and coiled-coil regions as potential protein binding regions. Possible cooperations between multiple regions within the same protein are taken into account. A fine-grained confidence system allows for varying the impact of specific protein interactions and region annotations on the modeling process. We apply our prediction approach to a large training set using a maximum likelihood method, compare different scoring functions for regions interactions, and validate the predicted interactions against a collection of experimentally observed interactions. In addition, we analyze prediction performance with respect to the inclusion of different region types, the incorporation of confidence values for training data, and the utilization of predicted protein interactions.

Up Back
Nonparametric estimation of posterior error probabilities associated with peptides identified by tandem mass spectrometry
Lukas Käll, John D Storey and William Stafford Noble
Abstract:
A mass spectrum produced via tandem mass spectometry can be tentatively matched to a peptide sequence via database search. Here, we address the problem of assigning a posterior error probability (PEP) to a given peptide-spectrum match (PSMs). This problem is considerably more difficult than the related problem of estimating the error rate associated with a large collection of PSMs. Existing methods for estimating PEPs rely on a parametric or semiparametric model of the underlying score distribution. We demonstrate how to apply nonparametric logistic regression to this problem. The method makes no explicit assumptions about the form of the underlying score distribution; instead, the method relies upon decoy PSMs, produced by searching the spectra against a decoy sequence database, to provide a model of the null score distribution. We show that our nonparametric logistic regression method produces accurate PEP estimates for six different commonly used PSM score functions. In particular, the estimates produced by our method are comparable in accuracy to those of PeptideProphet, which uses a parametric or semiparametric model designed specifically to work with SEQUEST. The advantage of the nonparametric approach is applicability and robustness to new score functions and new types of data. C++ code implementing the method as well as supplementary information is available at (http://noble.gs.washington.edu/proj/qvality).

Up Back
De novo identification of metabolites by analyzing tandem mass spectra
Sebastian Boecker and Florian Rasche
Abstract:
Motivation:
Mass spectrometry is among the most widely used technologies in proteomics and metabolomics. Being a high-throughput method, it produces large amounts of data that necessitates an automated analysis of the spectra. Clearly, database search methods for protein analysis can easily be adopted to analyze metabolite mass spectra. But for metabolites, de novo interpretation of spectra is even more important than for protein data, because metabolite databases cover only a small fraction of naturally occurring metabolites: Even the model plant Arabidopsis thaliana has a large number of enzymes whose substrates and products remain unknown. The field of bio-prospection searches biologically diverse areas for metabolites which might serve as pharmaceuticals. De novo identification of metabolite mass spectra requires new concepts and methods since, unlike proteins, metabolites possess a non-linear molecular structure.
Results:
In this work, we introduce a method for the fully automated de novo identification of metabolites from tandem mass spectra. Mass spectrometry data is usually assumed to be insufficient for identification of molecular structures, so we want to estimate the molecular formula of the unknown metabolite, a crucial step for its identification. The method first calculates all molecular formulas that explain the parent peak mass. Then, a graph is build where vertices correspond to molecular formulas of all peaks in the fragmentation mass spectra, whereas edges correspond to hypothetical fragmentation steps. Our algorithm afterwards calculates the maximum scoring subtree of this graph: Each peak in the spectra must be scored at most once, so the subtree shall contain only one explanation per peak. Unfortunately, finding this subtree is NP-hard. We suggest three exact algorithms (including one fixed-parameter tractable algorithm) as well as two heuristics to solve the problem. Tests on real mass spectra show that the FPT algorithm and the heuristics solve the problem suitably fast and provide excellent
Results:
For all 32 test compounds the correct solution was among the top five suggestions, for 26 compounds the first suggestion of the exact algorithm was correct.

Up Back
An environmental perspective on large-scale genome clustering based on metabolic capabilities
Gabi Kastenmüller, Johann Gasteiger and Hans-Werner Mewes
Abstract:
Motivation:
In principle, an organism's ability to survive in a specific environment, is an observable result of the organism's regulatory and metabolic capabilities. Nonetheless, our knowledge about the global relation of the metabolisms and the niches of organisms is still limited.
Results:
In order to further investigate this relation, we grouped organisms showing similar metabolic capabilities and systematically mapped their habitats onto these groups. For this purpose, we predicted the metabolic capabilities of 214 sequenced genomes. Based on these predictions, we grouped the genomes by hierarchical clustering. Finally, we mapped different environmental conditions and diseases related to the genomes onto the resulting clusters. This mapping uncovered several conditions and diseases that were unexpectedly enriched in clusters of metabolically similar organisms. As an example, Encephalitozoon cuniculi - a microsporidian causing a multisystemic disease accompanied by CNS problems in rabbits - occurred in the same metabolism-based cluster as bacteria causing similar symptoms in humans.

Up Back
Fusing Time Series Expression Data through Hybrid Aggregation and Hierarchical Merge
Elena Tsiporkova and Veselka Boeva
Abstract:
A novel integration approach targeting the combination of multi-experiment time series expression data is proposed. A recursive hybrid aggregation algorithm is initially employed to extract a set of genes, which are eventually of interest for the studied biological phenomenon. Next, a hierarchical merge procedure is specially developed for the purpose of fusing together the multiple-experiment expression profiles of the selected genes. This employs dynamic time warping alignment techniques in order to account adequately for the eventual phase shift between the different experiments. We demonstrate subsequently that the resulting gene expression profiles reflect consistently the behavior of the original expression profiles in the different experiments.

Up Back
Gaussian Process Modelling of Latent Chemical Species: Applications to Inferring Transcription Factor Activities
Pei Gao, Antti Honkela, Magnus Rattray and Neil Lawrence
Abstract:
Motivation:
Inference of latent chemical species in biochemical in- teraction networks is a key problem in estimation of the structure and parameters of the genetic, metabolic and protein interaction networks that underpin all biological processes. We present a framework for Bayesian marginalisation of these latent chemical species through Gaussian process priors.
Results:
We demonstrate our general approach on three different bi- ological examples of single input motifs, including both activation and repression of transcription. We focus in particular on the problem of inferring transcription factor activity when the concentration of active protein cannot easily be measured. We show how the uncertainty in the inferred transcription factor activity can be integrated out in order to derive a likelihood function that can be used for the estimation of regulatory model parameters. An advantage of our approach is that we avoid the use of a coarse-grained discretization of continuous-time functions, which would lead to a large number of additional param- eters to be estimated. We develop exact (for linear regulation) and approximate (for non-linear regulation) inference schemes, which are much more efﬁcient than competing sampling-based schemes and therefore provide us with a practical toolkit for model-based inference.
Availability:
The software and data for recreating all the experiments in this paper is available in MATLAB from http://www.cs.man.ac. uk/~neill/gpsim
Contact:
neill@cs.man.ac.uk

Up Back
SIRENE: Supervised Inference of Regulatory Networks
Fantine Mordelet and Jean-Phlippe Vert
Abstract:
Motivation:
Living cells are the product of gene expression programs that involve the regulated transcription of thousands of genes. The elucidation of transcriptional regulatory networks in thus needed to understand the cell's working mechanism, and can for example be useful for the discovery of novel therapeutic targets. Although several methods have been proposed to infer gene regulatory networks from gene expression data, a recent comparison on a large-scale benchmark experiment revealed that most current methods only predict a limited number of known regulations at a reasonable precision level.
Results:
We propose SIRENE, a new method for the inference of gene regulatory networks from a compendium of expression data. The method decomposes the problem of gene regulatory network inference into a large number of local binary classification problems, that focus on separating target genes from non-targets for each TF. SIRENE is thus conceptually simple and computationally efficient. We test it on a benchmark experiment aimed at predicting regulations in E. coli, and show that it retrieves of the order of 6 times more known regulations than other state-of-the-art inference methods.

Up Back
PhyloDetect: A likelihood-based strategy for detecting microor-ganisms with diagnostic microarrays
Hubert Rehrauer, Susan Schönmann, Leo Eberl and Ralph Schlapbach
Abstract:
Motivation:
Detection and identification of microbes using diagnos-tic arrays is still subject of ongoing research. Existing significance-based algorithms consider an organism detected even if a significant number of the microarray probes that match the organism are called absent in a hybridization. Further, they do generate redundant re-sults if the target organisms show high sequence similarity and the microarray probes cannot discriminate all of them.
Results:
We propose a new analysis strategy that considers organ-ism similarities and calls organisms only present if the probes that match the organism but are absent in a hybridization can be ex-plained by random events. In our strategy, we first identify the groups of target organisms that are actually distinguishable by the array. Subsequently, these organism groups are placed in a hierar-chical tree such that groups matching only less specific probes are closer to the tree root, and groups that are discriminated only by few probes are close to each other. Finally, we compute for each group a likelihood score that is based on a hypothesis test with the null hypothesis that the group was actually present in the hybridized sample. We have validated our strategy using data sets from two different array types and implemented it as an easy-to-use web application.
Availability:
http://www.fgcz.ethz.ch/PhyloDetect
Contact:
Hubert.Rehrauer@fgcz.uzh.ch
Supplementary information:
Example data is available at http://www.fgcz.ethz.ch/PhyloDetect

Up Back
Clinically driven semi-supervised class discovery in gene expression data
Israel Steinfeld, Roy Navon, Diego Ardigò, Ivana Zavaroni and Zohar Yakhini
Abstract:
Motivation:
Unsupervised class discovery in gene expression data relies on the statistical signals in the data to exclusively drive the results. It is often the case, however, that one is interested in constraining the search space to respect certain biological prior knowledge while still allowing a flexible search within these boundaries.
Results:
We develop approaches to semi-supervised class discovery. One approach uses known biological annotation of genes to drive the search, seeking partitions that manifest strong differential expression in specific sets of genes. In a second approach we use clinical sample information to constrain the search space and guide the class discovery process to yield biologically relevant partitions. We develop efficient algorithmics for these tasks, implementing both approaches and combinations thereof. We show that our method is robust enough to detect known clinical parameters with accordance to expected clinical values. We also use our method to elucidate cardiovascular disease (CVD) putative risk factors.
Availability:
MonoClaD (Monotone Class Discovery) See http://88.37.33.133/ClassDiscovery/index.html
Supplementary information:
See http://88.37.33.133/ClassDiscovery/index.html

Up Back
Protein Structure Alignment considering Phenotypic Plasticity
Gergely Csaba, Fabian Birzele and Ralf Zimmer
Abstract:
Motivation:
Protein structure comparison exhibits differences and similarities of proteins and protein families and may help to elucidate protein sequence and structure evolution. Despite many methods to score protein structure similarity with and without flexibility and to align proteins accurately based on their structures, a meaningful evolutionary distance measure and alignment method which models the cost of mutations, insertions and deletions occurring in protein sequences on the structure level is still missing.
Results:
Here we introduce a new measure for protein structure similarity and propose a novel method called PPM (Phenotypic Plasticity Method) which explicitly tries to model the evolutionary distance of two proteins on the structure level by measuring the cost of "morphing" one structure into the other one. PPM aligns protein structures taking variations naturally observed in groups of structures ("phenotypic plasticity") into account while preserving the overall topological arrangement of the structures. The performance of PPM in detecting similarities between protein structures is evaluated against well known structure classification methods on two benchmark sets. The larger set consists of more than 3.6 million structure pairs from the SCOP database which are also consistently classified in CATH. In the current parameterization, PPM already performs comparable or better than other methods such as TM-Align and Vorolign on those two sets according to different evaluation criteria showing that the method is able to reliably classify known protein structures, to detect their similarities and to compute accurate alignments despite phenotypic plasticity.
Availability:
Executables are available upon request. Datasets and supplementary data can be accessed on http://www.bio.ifi.lmu.de/PPM
Contact:
ralf.zimmer@bio.ifi.lmu.de

Up Back
Detection of 3D atomic similarities and their use in the discrimination of small-molecule protein binding sites
Rafael Najmanovich, Natalja Kurbatova and Janet M Thornton
Abstract:
Motivation:
Current computational methods for the prediction of function from structure are restricted to the detection of similarities and subsequent transfer of functional annotation. In a significant minority of cases, global sequence or structural (fold) similarities do not provide clues about protein function. In these cases, one alternative is to detect local binding site similarities. These may still reflect more distant evolutionary relationships as well as unique physico-chemical constraints necessary for binding similar ligands, thus helping pinpoint the function. In the present work we ask the following question: Is it possible to discriminate within a dataset of non-homologous proteins those that bind similar ligands based on their binding site similarities? Methods: We implement a graph-matching based method for the detection of 3D atomic similarities introducing some simplifications that allow us to extend its applicability to the analysis of large all-atom binding site models. This method, called IsoCleft, does not require atoms to be connected either in sequence or space. We apply the method to a cognate-ligand bound dataset of non-homologous proteins. We define a family of binding site models with decreasing knowledge about the identity of the ligand-interacting atoms to uncouple the questions of predicting the location of the binding site and detecting binding site similarities. Furthermore, we calculate the individual contributions of binding site size, chemical composition and geometry to prediction performance.
Results:
We find that it is possible to discriminate between different ligand binding sites. In other words, there is a certain uniqueness in the set of atoms that are in contact to specific ligand scaffolds. This uniqueness is restricted to the atoms in close proximity of the ligand in which case, size and chemical composition alone are sufficient to discriminate binding sites. Discrimination ability decreases with decreasing knowledge about the identity of the ligand-interacting binding site atoms. The decrease is quite abrupt when considering size and chemical composition alone but much slower when including geometry. We also observe that certain ligands are easier to discriminate. interestingly, the subset of binding site atoms belonging to highly conserved residues is not sufficient to discriminate binding sites, implying that convergently evolved binding sites arrived at dissimilar solutions.
Availability:
IsoCleft can be obtained from the authors.
Contact:
rafael.najmanovich@ebi.ac.uk

Up Back
RNA structure alignment by a unit-vector approach
Emidio Capriotti and Marc A. Marti-Renom.
Abstract:
Motivation:
The recent discovery of tiny RNA molecules such as µRNAs and small interfering RNA are transforming the view of RNA as a simple information transfer molecule. Similar to proteins, the native three-dimensional structure of RNA determines its biological activity. Therefore, classifying the current structural space is para-mount for functionally annotating RNA molecules. The increasing numbers of RNA structures deposited in the PDB requires more accurate, automatic and benchmarked methods for RNA structure comparison. In this paper, we introduce a new algorithm for RNA structure alignment based on a unit-vector approach. The algorithm has been implemented in the SARA program, which results in RNA structure pairwise alignments and their statistical significance.
Results:
The SARA program has been implemented to be of gen-eral applicability even when no secondary structure can be calcu-lated from the RNA structures. A benchmarked against the ARTS program using a set of 1,275 non-redundant pairwise structure alignments results in ~6% extra alignments with at least 50% structurally superposed nucleotides and base pairs. A first attempt to perform RNA automatic functional annotation based on structure alignments indicates that SARA can correctly assign the deepest SCOR classification to more than 60% of the query structures.
Availability:
The SARA program is freely available through a World Wide Web server http://sgu.bioinfo.cipf.es/services/SARA/.

Up Back
Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining
Shi Yu, Steven Van Vooren, Leon-Charles Tranchevent, Bart De Moor and Yves Moreau
Abstract:
Motivation:
Computational gene prioritization methods are useful to help identify susceptibility genes potentially being involved in genetic disease. Recently, text mining techniques have been applied to extract prior knowledge from text based genomic information sources and this knowledge can be used to improve the prioritization process. However, the effect of various vocabularies, representations and ranking algorithms on text mining for gene prioritization is still an issue that requires systematic and comparative studies. Therefore a benchmark study about the vocabularies, representations and ranking algorithms in gene prioritization by text mining is discussed in this paper.
Results:
We investigated 5 different domain vocabularies, 2 text representation schemes and 27 linear ranking algorithms for disease gene prioritization by text mining. We indexed 288,177 MEDLINE titles and abstracts with the TxtGate text profiling system and adapted the benchmark data set of the Endeavour gene prioritization system which consists of 618 disease causing genes. Textual gene profiles were created and their performance for prioritization were evaluated and discussed in a comparative manner. The results show that inverse document frequency (IDF) based representation of gene term vectors performs better than the term-frequency inverse document- frequency (TFIDF) representation. The eVOC and MESH domain vocabularies perform better than GO, OMIM, and LDDB. The ranking algorithms based on 1-SVM, Standard Correlation and Ward linkage method provide the best performance.
Availability:
The MATLAB code of the algorithm and benchmark data sets are available by request.
Contact:
shi.yu@esat.kuleuven.be
Supplementary information:
Supplementary material has been submitted together with the paper for review and will be accessible on the web site.

Up Back
Inter–species normalization of gene mentions with GNAT
Joerg Hakenberg, Conrad Plake, Robert Leaman, Michael Schroeder and Graciela Gonzalez
Abstract:
Motivation:
Text mining in the biomedical domain aims at helping researchers to access information contained in scientific publications in a faster, easier, and more complete way. One step towards this aim is the recognition of named entities and their subsequent normalization to database identifiers. Normalization helps linking objects of potential interest, for instance, to detailed information not contained in a publication; it is also key for integrating different knowledge sources. From an information retrieval perspective, normalization facilitates indexing and querying. Gene mention normalization is particularly challenging given that most genes are known for various species, that there are different genes sharing similar names, and include common English words, diseases, or other biomedical concepts.
Results:
We present the first publicly available system, GNAT, reported to handle inter-species gene mention normalization. Our method uses extensive background knowledge on genes to resolve ambiguous names to EntrezGene identifiers. It performs comparably to single-species approaches proposed by us and others. On a benchmark set derived from BioCreative~1 and~2 data that contains genes from 13 species, GNAT achieves an f-measure of 81.2% (95.0% precision at 70.9% recall). For the single-species task, we report an f-measure of 85.4% on human genes.
Availability:
A web-frontend is available at http://cbioc.eas.asu.edu/gnat/. We will make our system available as a web-service in the BioCreative MetaService project, see http://bcms.bioinfo.cnio.es.

Up Back
Analysis of Segmental Duplications via Duplication Distance
Crystal Kahn and Benjamin Raphael
Abstract:
Motivation:
Segmental duplications are common in mammalian genomes, but their evolutionary origins remain mysterious. A major difficulty in analyzing segmental duplications is that many duplications are complex mosaics of fragments of numerous other segmental duplications.
Results:
We introduce a novel measure called duplication distance that describes the minimum number of duplications necessary to create a target string by repeated insertions of fragments of a source string. We derive an efficient algorithm to compute duplication distance, and we use the algorithm to analyze segmental duplications in the human genome. Our analysis reveals possible ancestral relationships between segmental duplications including numerous examples of duplications that contain multiple, nested insertions of fragments from one or more other duplications. Using duplication distance, we also identify a small number of segmental duplications that seeded many other duplications in the genome, lending support to a two-step model of segmental duplication in the genome.
Availability:
Software for computing duplication distance is available upon request.
Contact:
braphael@cs.brown.edu, clkahn@cs.brown.edu

Up Back
A Fast and Flexible Method for the Segmentation of aCGH Data
Erez Ben-Yaacov and Yonina Eldar
Abstract:
Motivation:
Array Comparative Genomic Hybridization (aCGH) is used to scan the entire genome for variations in DNA copy number. A central task in the analysis of aCGH data is the segmentation into groups of probes sharing the same DNA copy number. Some well known segmentation methods suffer from very long running times, preventing interactive data analysis.
Results:
We suggest a new segmentation method based on wave-let decomposition and thresholding, which detects significant break-points in the data. Our algorithm is over 1,000 times faster than leading approaches, with similar performance. Another key advan-tage of the proposed method is its simplicity and flexibility. Due to its intuitive structure it can be easily generalized to incorporate several types of side information. Here we consider two extensions which include side information indicating the reliability of each measure-ment, and compensating for a changing variability in the measure-ment noise. The resulting algorithm outperforms existing methods, both in terms of speed and performance, when applied on real high density CGH data.
Availability:
Implementation is available under software tab at: http://www.ee.technion.ac.il/Sites/People/YoninaEldar/
Contact:
yonina@ee.technion.ac.il

Up Back
Poisson Adjacency Distributions in Genome Comparison: Multichromosomal, Circular, Signed and Unsign
Wei Xu, Benoît Alain and David Sankoff
Abstract:
The number of common adjacencies of genetic markers, as a measure of the similarity of two genomes, has been widely used as indicator of evolutionary relatedness and as the basis for inferring phylogenetic relationships. Its probability distribution enables statistical tests in detecting whether significant evolutionary signal remaining in the genomes. In this paper, we derive the probability distributions of the number of adjacencies for a number of types of genome---signed or unsigned, circular or linear, single-chromosomal or multichromosomal. Generating functions are found for single-chromosome cases, from which exact counts can be calculated. The unsigned single-chromosome case is related to the dinner table problem and the non-attacking kings problem. Probabilistic approaches are adopted for the multichromosomal cases, where we find the exact values for expectations and variances. In both cases, the limiting distributions are derived, in term of numbers of adjacencies. For all unsigned cases, the limiting distributions is Poisson with parameter 2; for all signed cases, the limiting distribution is Poisson with parameter 1/2

Up Back
HapCUT: An efficient and accurate Algorithm for the Haplotype Assembly Problem
Vikas Bansal and Vineet Bafna
Abstract:
The goal of the haplotype assembly problem is to reconstruct the two haplotypes (chromosomes) for an individual using a mix of sequenced fragments from the two chromosomes. This problem has been shown to be computationally intractable for various optimization criteria. Polynomial time algorithms have been proposed for restricted versions of the problem. In this paper, we consider the haplotype assembly problem in the most general setting, i.e. fragments of any length and with an arbitrary number of gaps. We describe a novel combinatorial approach for the haplotype assembly problem based on computing max-cuts in certain graphs derived from the sequenced fragments. Levy et al. have sequenced the complete genome of a human individual and used a greedy heuristic to assemble the haplotypes for this individual. We have applied our method HapCUT to infer haplotypes from this data and demonstrate that the haplotypes inferred using HapCUT are significantly more accurate (20-25% lower MEC scores for all chromosomes) than the greedy heuristic and a previously published method, Fast Hare. We also describe a Maximum Likelihood based estimator of the absolute accuracy of the sequence-based haplotypes using population haplotypes from the International HapMap project. A program implementing HapCUT is available on request.

Up Back
Efficient representation and p-value computation for high order Markov motifs
Paulo Fonseca, Christian Gautier, Katia Guimarães and Marie-France Sagot
Abstract:
Motivation:
Position weight matrices (PWMs) have become a standard for representing biological sequence motifs. Their relative simplicity has favoured the development of efficient algorithms for diverse tasks such as motif identification, sequence scanning, and statistical significance evaluation. Markov chain-based models generalise the PWM model by allowing for inter-position dependencies to be considered, at the cost of substantial computational overhead, which has limited their application.
Results:
In this paper, we consider two aspects regarding the use of higher order Markov models for biological sequence motifs, namely, the representation and the computation of p-values for motifs described by a set of example occurrences. We propose an efficient representation based on the use of tries, from which empirical position-specific conditional base probabilities can be computed, and extend state-of-the-art PWM-based algorithms to allow for the computation of exact p-values for high order Markov motif models.
Availability:
The software is available in the form of a Java object-oriented library from http://www.cin.ufpe.br/~paguso/kmarkov.
Contact:
paguso@cin.ufpe.br

Up Back
Mining significant tree patterns in carbohydrate sugar chains
Kosuke Hashimoto, Ichigaku Takigawa, Motoki Shiga, Minoru Kanehisa and Hiroshi Mamitsuka
Abstract:
Carbohydrate sugar chains or glycans, the third major class of macromolecules, hold branch shaped tree structures. Glycan motifs are known to be two types: 1) conserved patterns called cores'' containing the root and 2) ubiquitous motifs which appear in external parts including leaves and are distributed over different glycan classes. Finding these glycan tree motifs is an important issue, but there have been no computational methods to capture these motifs efficiently. We have developed an efficient method for mining motifs or significant subtrees from glycans. The key contribution of this method is: 1) to have proposed a new concept, alpha-closed frequent subtrees'', and an efficient method for mining all these subtrees from given trees and 2) to have proposed to apply statistical hypothesis testing to rerank the frequent subtrees in significance. We experimentally verified the effectiveness of the proposed method using real glycans: 1) We examined the top ten subtrees obtained by our method at some parameter setting and confirmed that all subtrees are significant motifs in glycobiology. 2) We applied the results of our method to a classification problem and found that our method outperformed other competing methods, SVM with three different tree kernels, being all statistically significant.

Up Back
Optimal Spliced Alignments of Short Sequence Reads
Fabio de Bona, Stephan Ossowski, Korbinian Schneeberger and Gunnar Rätsch
Abstract:
Motivation:
Next generation sequencing technologies open exciting new possibilities for genome and transcriptome sequencing. While reads produced by these technologies are relatively short and error prone compared to the Sanger method their throughput is several magnitudes higher. To utilize such reads for transcriptome sequencing and gene structure identification, one needs to be able to accurately align the sequence reads over intron boundaries. This represents a significant challenge given their short length and inherent high error rate.
Results:
We present a novel approach, called QPALMA, for computing accurate spliced alignments which takes advantage of the read’s quality information as well as computational splice site predictions. Our method uses a training set of spliced reads with quality information and known alignments. It uses a large margin approach similar to support vector machines to estimate its parameters to maximize alignment accuracy. In computational experiments we illustrate that the quality information as well as the splice site predictions help to improve the alignment quality. Finally, to facilitate mapping of massive amounts of sequencing data typically generated by the new technologies, we have combined our method with a fast mapping pipeline based on enhanced suffix arrays. Our algorithms were optimized and tested using reads produced with the Illumina Genome Analyzer for the model plant Arabidopsis thaliana.

Up Back
ChIPmix: Mixture model of regressions for two-color ChIP-chip analysis
Marie-Laure Martin-Magniette, Tristan Mary-Huard, Caroline Bérard and Stéphane Robin
Abstract:
Chromatin immunoprecipitation (ChIP) combined with DNA microarray is a high-throughput technology to investigate DNA-protein-binding or chromatin/histone modifications. ChIP-chip data require adapted statistical method in order to identify enriched regions. All methods already proposed are based on the analysis of the log-ratio (Ip/Input). Nevertheless the assumption that the log-ratio is a pertinent quantity to assess the probe status is not always verified and it leads to a poor data interpretation. Instead of working on the log-ratio, we directly work with the Ip and Input signals of each probe by modeling the distribution of the Ip signal conditional to the Input signal. We propose a method named ChIPmix based on a linear regression mixture model to identify actual binding targets of the protein under study. Moreover we are able to control the proportion of false-positives. The efficiency of ChIPmix is illustrated on several datasets obtained from different organisms and hybridized either on tiling or promoter arrays. This validation shows that ChIPmix is convenient for any two-color array whatever its density and provides promising results.

Up Back
Segment-based multiple sequence alignment
Tobias Rausch, Anne-Katrin Emde, David Weese, Andreas Döring, Cedric Notredame and Knut Reinert
Abstract:
Motivation:
Many multiple sequence alignment tools have been developed in the past, progressing either in speed or alignment accuracy. Given the importance and wide-spread use of alignment tools, progress in both categories is a contribution to the community and has driven research in the field so far.
Results:
We introduce a graph-based extension to the consistency-based, progressive alignment strategy. We apply the consistency notion to segments instead of single characters. The main problem we solve in this context is to define segments of the sequences in such a way that a graph-based alignment is possible. We implemented the algorithm using the SeqAn library and report results on amino acid and DNA sequences. The benefit of our approach is threefold: (1) sequences with conserved blocks can be rapidly aligned, (2) the implementation is conceptually easy, generic and fast, and (3) the consistency idea can be extended to align multiple genomic sequences.
Availability:
The segment-based multiple sequence alignment tool can be downloaded from http://www.seqan.de/projects/msa.html. A novel version of T-Coffee interfaced with the tool is available from http://www.tcoffee.org. The usage of the tool is described in both documentations.
Contact:
rausch@inf.fu-berlin.de

Up Back
Connect the dots: Exposing hidden protein family connections from the entire sequence tree
Yaniv Loewenstein and Michal Linial
Abstract:
Motivation:
Mapping of remote evolutionary links is a classic com-putational problem of much interest. Relating protein families allows for functional and structural inference on uncharacterized families. Since sequences have diverged beyond reliable alignment, these are too remote to identify by conventional methods. Approach: We present a method to systematically identify remote evolutionary relations between protein families, leveraging a novel evolutionary-driven tree of all protein sequences and families. A global approach which considers the entire volume of similarities while clustering sequences, leads to a robust tree that allows tracing of very faint evolutionary links. The method systematically scans the tree for clusters which partition exceptionally well into extant protein families, thus suggesting an evolutionary breakpoint in a putative ancient superfamily. Our method does not require family profiles (or HMMs), or multiple alignment.
Results:
Considering the entire Pfam database, we are able to sug-gest 710 links between protein families, 125 of which are confirmed by existence of Pfam clans. The quality of our predictions is also validated by structural assignments. We further provide an intrinsic characterization of the validity of our results and provide examples for new biological findings, from our systematic scan. For example, we are able to relate several bacterial pore-forming toxin families, and then link them with a novel family of eukaryotic toxins expressed in plants, fish venom, and notably also uncharacterized proteins from human pathogens.
Availability:
A detailed list of putative homologous superfamilies, including 210 families of unknown function, has been made avail-able online: http://www.protonet.cs.huji.ac.il/dots.

Up Back
Improved Network-based Identification of Protein Orthologs
Nir Yosef, Roded Sharan and William Stafford Noble
Abstract:

Motivation: Identifying protein orthologs is an important task that is receiving growing attention in the bioinformatics literature. Orthology detection provides a fundamental tool towards understanding protein evolution, predicting protein functions and interactions, aligning protein-protein interaction networks of different species and detecting conserved modules within these networks. \section{
Results:
Here, we present a novel diffusion-based framework that builds on the Rankprop algorithm for protein orthology detection and enhances it in several important ways. Specifically, we enhance the Rankprop algorithm to account for the presence of multiple paralogs, utilize protein-protein interactions, and consider multiple (>2) species in parallel. We comprehensively benchmarked our framework using a variety of training data sets and experimental settings. The results, based on the yeast, fly and human proteomes, show that the novel enhancements of Rankprop provide substantial improvements over its original formulation as well as over a number of state of the art methods for network-based orthology detection. \section{
Availability:
Data sets and source code are available upon request.

Up Back
Comprehensive in silico mutagenesis highlights functionally important residues in proteins
Yana Bromberg and Burkhard Rost
Abstract:
Motivation:
Mutating residues into alanine (alanine scanning) is one of the fastest experimental means of probing hypotheses about protein function. Alanine scans can reveal functional hot spots, i.e. residues that alter function upon mutation. In vitro mutagenesis is cumbersome and costly: probing all residues in a protein is typically as impossible as substituting by all non-native amino acids. In contrast, such exhaustive mutagenesis is feasible in silico.
Results:
Previously, we developed SNAP to predict functional changes due to non-synonymous single nucleotide polymorphisms. Here, we applied SNAP to all experimental mutations in the ASEdb database of alanine scans; we identified 70% of the hot spots (>1kCal/mol change in binding energy); more severe changes were predicted more accurately. Encouraged, we carried out a complete all-against-all in silico mutagenesis for human glucokinase. Many of the residues predicted as functionally important have indeed been confirmed in the literature, others await experimental verification, and our method is ready to aid in the design of in vitro mutagenesis.
Availability:
ASEdb and glucokinase scores are available at http://www.rostlab.org/services/SNAP. For submissions of large/whole proteins for processing please contact the author.
Contact:
yb2009@columbia.edu

Up Back
Automatic decomposition of kinetic models of signaling networks minimizing the retroactivity among modules xxx
Julio Saez-Rodriguez, Stefan Gayer, Martin Ginkel and Ernst Dieter Gilles
Abstract:
Motivation:
The modularity of biochemical networks in general, and signaling networks in particular, has been extensively studied over the past few years. It has been proposed to be a useful property to analyze signaling networks: by decomposing the network into subsystems, more manageable units are obtained which are easier to analyze. While many powerful algorithms are available to identify modules in protein interaction networks, less attention has been paid to signaling networks defined as chemical systems. Such a decomposition would be very useful as most quantitative models are defined using the latter, more detailed formalism.
Results:
Here, we introduce a novel method to decompose biochemical networks into modules so that the bidirectional (retroactive) couplings among the modules are minimized. Our approach adapts a method to detect community structures, and applies it to the so-called retroactivity matrix that characterizes the couplings of the network. Only the structure of the network, e.g. in SBML format, is required. Furthermore, the modularized models can be loaded into ProMoT, a modeling tool which supports modular modeling. This allows visualization of the models, exploiting their modularity, and easy generation of models of one or several modules for further analysis. The method is applied to several relevant cases, including an entangled model of the EGF-induced MAPK cascade and a comprehensive model of EGF-signaling, demonstrating its ability to uncover meaningful modules. Our approach can thus help to analyze large networks, especially when little a priori knowledge on the structure of the network is available.

Up Back
From minimal signed circuits to the dynamics of Boolean regulatory networks
Elisabeth Remy and Paul Ruet
Abstract:
It is acknowledged that the presence of positive or negative circuits in regulatory networks such as genetic networks is linked to the emergence of significant dynamical properties such as multistability (involved in differentiation) and periodic oscillations (homeostasis). Rules proposed by the biologist R. Thomas assert that these circuits are necessary for such dynamical properties. These rules have been studied by several authors. Their obvious interest is that they relate the rather simple information contained in the structure of the network (signed circuits) to its much more complex dynamical behaviour. We prove in this article a non-trivial converse of these rules, namely that certain positive or negative circuits in a regulatory graph are actually \emph{sufficient} for the observation of a restricted form of the corresponding dynamical property, differentiation or homeostasis. More precisely, the crucial property that we require is that the circuit be globally elementary. We then apply these results to the vertebrate immune system, and show that the 2 elementary functional positive circuits of the model indeed behave as \emph{modules} which combine to explain the presence of the 3 stable states corresponding to the Th0, Th1 and Th2 cells.

Up Back
Temporal Logic Patterns for Querying Dynamic Models of Cellular Interaction Networks
Pedro T Monteiro, Delphine Ropers, Radu Mateescu, Ana T. Freitas and Hidde de Jong.
Abstract:
Motivation:
Models of the dynamics of cellular interaction networks have become increasingly larger in recent years. Formal verification based on model checking provides a powerful technology to keep up with this increase in scale and complexity. The application of model-checking approaches is hampered, however, by the difficulty for non-expert users to formulate appropriate questions in temporal logic.
Results:
In order to deal with this problem, we propose the use of patterns, that is, high-level query templates that capture recurring biological questions and that can be automatically translated into temporal logic. The applicability of the developed set of patterns has been investigated by the analysis of an extended model of the network of global regulators controlling the carbon starvation response in Escherichia coli.
Availability:
GNA and the model of the carbon starvation response network are available at http://www-helix.inrialpes.fr/gna

Up Back
Logical modelling of the role of the Hh pathway in the patterning of the Drosophila wing disc.
Aitor Gonzalez, Claudine Chaouiya and Denis Thieffry
Abstract:
Motivations: The development of most tissues and organs relies on a limited number of signal transduction pathways enabling the coordination of cellular differentiation. A proper understanding of the roles of signal transduction pathways requires the definition of formal models capturing the main qualitative features of these patterning processes. This is a challenging task due to the diversity of underlying processes: diffusion, reception and sequestration of signalling molecules, regulatory modifications of signal transduction proteins, transcriptional regulation of target genes, etc. Moreover, most of these processes are only partly characterised, which complicates the recourse to standard quantitative formal frameworks. In contrast, qualitative models can be more readily proposed on the basis of available (molecular) genetic data. But this requires novel computational tools and proper qualitative representations of phenomena such as diffusion or sequestration. In order to assess the power and limits of a logical formalism in this context, we propose a multi-level model of the multi-cellular network involved in the definition of the anterior-posterior boundary during the development of the wing disc of Drosophila melanogaster. The morphogen Hedgehog (Hh) is the inter-cellular signal coordinating this process. It is induced and diffuses from the posterior compartment of the disc to activate its pathway in cells immediately anterior to the boundary. In these boundary cells, the Hh gradient induces target genes in distinct domains as a function of the Hh concentration. One target of Hh signalling is the gene coding for the receptor Patched (Ptc), which sequesters Hh and impedes further diffusion, thereby refining the boundary.
Results:
We have delineated a logical model of the patterning process defining the cellular anterior-posterior boundary in the developing imaginal disc of Drosophila melanogaster. This model qualitatively accounts for the formation of a gradient of Hh, as well as for the transduction of this signal through a balance between the activatory (CiA) and inhibitory (CiR) products of the gene cubitus interruptus (ci). Wild-type and mutant simulations have been carried out to assess the coherence of the model with experimental data. Interestingly, our computational analysis provides novel insights into poorly understood processes such as the regulation of Ptc by CiR, the formation of a functional gradient of CiA across boundary cells, or yet functional En differences between anterior and posterior cells. In conclusion, our model analysis demonstrates the flexibility of the flexibility of the logical formalism, enabling consistent qualitative representation of diffusion, sequestration and post/transcriptional regulatory processes within and between neighbouring cells.
Availability:
An XML file containing the proposed model together with annotations can be downloaded from our website (http://gin.univ-mrs.fr/GINsim/), along with GINsim, a logical modelling and simulation software freely available to academic groups.

Up Back
Gain and Loss of Phosphorylation Sites as an Active Mechanism in Cancer
Predrag Radivojac, Peter Baenziger, Maricel Kann, Matthew Hahn, Matthew Mort and Sean Mooney
Abstract:
Motivation:
Coding-region mutations in human genes are responsible for a diverse spectrum of diseases and phenotypes. Among lesions that have been studied extensively, there are insights into several of the biochemical functions disease-causing mutations disrupt. Currently, there are more than 60,000 coding-region mutations associated with inherited disease catalogued in the Human Gene Mutation Database (HGMD, August 2007) and more than 70,000 polymorphic amino acid substitutions recorded in dbSNP (dbSNP, build 127). Understanding the mechanism and contribution these variants make to a clinical phenotype is a formidable problem.
Results: In this study, we investigate the role of phosphorylation in somatic cancer mutations and inherited diseases. Somatic cancer mutation datasets were shown to have a significant enrichment for mutations that cause gain or loss of phosphorylation when compared to our control datasets (putatively neutral nsSNPs and random nsSNPs). Of the somatic cancer mutations, those in kinase genes represent the most enriched set of mutations that disrupt phosphorylation sites, suggesting phosphorylation target site mutation is an active cause of phosphorylation deregulation. Overall, this evidence suggests both gain and loss of a phosphorylation site in a target protein may be important features for predicting cancer-causing mutations and may represent a molecular cause of disease for a number of inherited and somatic mutations.

Up Back
Logical Analysis of Survival Data: Prognostic survival models by detecting high degree interactions
Louis-Philippe Kronek and Reddy Anupama
Abstract:
Motivation:
Survival analysis involves predicting the time to event for patients in a dataset, based on a set of recorded attributes. In this study we focus on right-censored survival problems. Detecting high degree interactions for the estimation of survival probability is a challenging problem in survival analysis from the statistical perspective.
Results:
We propose a new methodology, Logical Analysis of Survival Data (LASD), to identify interactions between variables (survival patterns) without any prior hypotheses. Using these set of patterns, we predict survival distributions for each observation. To evaluate LASD we select two publicly available datasets: a lung adenocarcinoma dataset (gene-expression profiles) and the other a breast cancer dataset (clinical profiles). The performance of LASD when compared with survival decision trees improves the cross-validation accuracy by 18% for the gene-expression dataset, and by 2% for the clinical dataset.