OP-19 Data-adaptive test statistics for microarray
Sach Mukherjee (1), Stephen Roberts (1), Mark
1) University of California, Berkeley
Motivation: A vitally important task in microarray is the selection of genes which are differentially different kinds of tissue samples, such as healthy However, microarray data contain an enormous dimensions (genes), and very few samples (arrays), which poses fundamental statistical problems process which have defied easy resolution.
Results: In this paper, we present a novel selection in which test statistics are learned simple notion of reproducibility in selection learning criterion. Reproducibility, as we computed without any knowledge of the `ground- advantage of certain properties of microarray asymptotically valid guide to expected loss under datagenerating distribution. We are therefore minimize expected loss, and produce results robust than conventional methods. We apply simulated and oligonucleotide array data.
OP-20 Identifying active transcription factors and kinases from expression data using Pathway Queries
Florian Sohler (1), Ralf Zimmer (1)
1) Department of Informatics, Ludwig-Maximilians-Universität München
Although progress has been made identifying regulatory relationships from expression data in general, only few methods have focused on detecting biological mechanisms like active pathways using a single measurement. This is of particular importance when only few measurements are available, e.g. if special cell types or conditions are under investigation. Here we present a method to test user specified hypotheses (pathway queries) on expression data where prior knowledge is given in the form of networks. Based on that method, we develop a scoring function to identify active transcription factors or kinases, thus making a first step towards explaining the measured expression data.
We apply the algorithm to the Rosetta Yeast Compendium dataset, finding that in many cases the results are in concordance with biological knowledge. We were able to confirm that transcription factors and to a lesser degree kinases identified by our method play an important role in the biological processes affected by the respective knock-outs. Furthermore, we show that correlation of inferred activities can provide evidence for a physical interaction or cooperation of transcription factors where correlation of plain expression data fails to do so.
OP-21 Analyzing Microarray Data Using Quantitative Association Rules
Elisabeth Georgii (1), Lothar Richter (1), Ulrich Rückert (1), Stefan Kramer (1)
1) TU München
We tackle the problem of finding regularities in microarray data. A wide variety of data mining tools such as clustering, classification, Bayesian networks and association rules have been applied so far to gain insight into gene expression data. Association rule mining techniques used so far work on discretizations of the data and cannot account for cumulative effects. In this paper, we investigate the use of quantitative association rules that can operate directly on numeric data and represent cumulative effects of variables. Technically speaking, this type of quantitative association rules based on half-spaces can find non-axis-parallel regularities.
Results: We performed a variety of experiments testing the utility of quantitative association rules for microarray data. First of all, the results should be statistically significant and robust against fluctuations in the data. Next, the approach should be scalable in the number of variables, which is important for such high-dimensional data. Finally, the rules should make sense biologically and be sufficiently different from rules found in regular association rule mining working with discretizations. In all of these dimensions, the proposed approach performed satisfactorily. Therefore, quantitative association rules based on half spaces should be considered as a tool for the analysis of microarray gene expression data.
OP-22 A fully Bayesian model to cluster gene expression profiles
Claus Vogl (1), Fatima Sanchez-Cabo (2), Gernot Stocker (3), Simon Hubbard (4), Olaf Wolkenhauer (5), Zlatko Trajanoski (3)
1) Institute for animal breeding and genomics, Veterinärmedizinische universität Wien, 2) Institute for Genomics and Bioinformatics, Graz University of Technology, Graz, 3) Institute for Genomics and Bioinformatics, Graz University of Technology, Graz, 4) Faculty of Life sciences, University of Manchester, Manchester, 5) Institute of informatics, University of Rostock
ABSTRACT: With cDNA or oligonucleotide chips, gene expression levels of essentially all genes in a genome can be simultaneously monitored over a timecourse or under different experimental conditions. After proper normalization of the data, genes are often classified into co-expressed classes (clusters) to identify subgroups of genes that share common regulatory elements, a common function, or a common cellular origin, or to impute missing values. With most ad-hoc methods, e.g. k-means, the number of clusters needs to be specified in advance; results depend strongly on this choice. Even with likelihood based methods, estimation of this number is difficult. Furthermore, missing values often cause problems and lead to the loss of data.
Results: We propose a fully probabilistic Bayesian model to cluster gene expression profiles. The number of classes does not need to be specified in advance; instead it is adjusted dynamically using a Reversible Jump Markov Chain Monte Carlo (RJMCMC) sampler. Imputation of missing values is integrated into the model. With simulations, we determined the speed of convergence of the sampler as well as the accuracy of the inferred variables. Results were compared to the widely used k-means algorithm. With our method, biologically related co-expressed genes could be identified in a yeast transcriptome data set, even when some values were missing.
OP-23 Fusing microarray experiments with multivariate regression
Walter Gilks (1), Brian Tom (1), Alvis Brazma (2),
1) Medical Research Council, 2)EMBL-EBI
Motivation: It is widely acknowledged that microarray data are subject to high noise levels and results are often platform dependent. Therefore, microarray experiments should be replicated several times and in several laboratories before results can be relied upon. To make best use of such extensive datasets, methods for microarray data fusion are required. Ideally, the fused data should distil important aspects of the data whilst suppressing unwanted sources of variation, and be amenable to further informal and formal methods of analysis. Also, account should be taken of variability in the quality of experimentation.
Results: We present such an approach to data fusion, based on multivariate regression. We apply our methodology to data from Rustici et al(2004) on cell-cycle control in S.pombe. Availability: The algorithm implemented in R is freely available from the authors on request.