|9:00/13:15 - Room 9 Bis (1st Floor)
T2: GEPAS: New challenges in microarray data analysis
Dr. Joaquin Dopazo
DNA microarray technology is an essential tool for studying biological processes at the genomic level. Nevertheless, with the advent of genome-wide methodologies, new challenges have arisen related to the analysis of huge amounts of data being produced. Important topics in microarray data analysis include: data processing, normalization and data transformation procedures; clustering; supervised classification and gene selection; and functional annotation. The proposed tutorial will address these issues by combining theoretical lectures, with practical sessions. The practical sessions will ultimately demonstrate the use of our web-based suite of tools, GEPAS, for microarray data analysis.
In the tutorial several highly relevant topics in the analysis of cDNA microarray data will be taught. The tutorial will include theoretical lectures and practical sessions during which state-of-the-art methods will be used, as implemented in GEPAS (Herrero et al., 2003, NAR 31, 3461-3467; http://www.gepas.org). Our group has developed this suite of tools for DNA array data analysis that, in addition to being used for training purposes during the tutorial, is also becoming standard, with more than 300 experiments analyzed per day.The tutorial will illustrate the most common steps of data preprocessing followed by different methodologies used to answer different scientific questions for which microarray are used. These include clustering, identification of differentially expressed genes, supervised classification, gene expression signatures and functional annotation of the results.
2.1. Data processing
2.1.1 Normalization of cDNA arrays
The course will cover: reading GenePix data, initial diagnostic plots, diagnostic plots for print-order normalization, print-tip loess normalization, and evaluation of normalization results. Our web server, GEPAS, through the DNMAD tool (Vaquerizas et al., Bioinformatics 2004; Herrero et al., NAR, 2004) gives access to print-tip normalization and print order normalization using loess, as implemented in the limma and marrayNorm packages in Bioconductor. Print-tip normalization is one of the most commonly used normalization methods and has repeatedly shown, to perform well (Yang et al., 2002 NAR 30:e15; Smyth & Speed, 2003 submitted).
2.1.2. Data transformation
In addition to normalization, data must be cleaned and transformed. This theoretical/practical session will deal with the most common transformation steps. GEPAS implements two interfaces for data transformation, the PREPROCESSOR (Herrero et al., 2003, Bioinformatics 19, 655-656) and Knowledge Filtering, which provides the user with the most common transformations: logarithmic, merging of replicates, filtering and/or imputation of missing values by different methods, filtering of flat patterns, filtering by biological information on genes and standardization of patterns.
The identification of co-expressing genes or the definition of subgroups of experimental conditions based on the comparison of gene expression profiles is achieved by means of the application of clustering techniques. The tutorial will review different clustering techniques. Aggregative hierarchical clustering in its different variants (average-linkage, single-linkage, complete-linkage, etc.) (Sneath & Sokal, 1973, Numerical Taxonomy. W. H. Freeman, San Francisco) is still one of the preferred choices for the analysis of gene expression patterns. As an alternative to hierarchical clustering, other non-hierarchical methods, such as kmeans (Hartigan, 1975 Clustering algorithms. Wiley, New York), have been used. Other authors have proposed the use of neural networks as an alternative (Tamayo et al., 1999 Proc. Natl. Acad. Sci. USA 96:2907-2912; Herrero et al., 2001 Bioinformatics. 17:126-136). Unsupervised neural networks, such as Self-Organising Maps (SOM) (Kohonen, 1997 Self-organizing maps, Berlin. Springer-Verlag) or the Self-Organising Tree Algorithm (SOTA) (Dopazo and Carazo, 1997 J. Mol. Evol 44:226-233), provide a more robust framework, appropriate for clustering large amounts of noisy data. The GEPAS package implements some of the most used clustering techniques, including average linkage, k-means, SOM, SOTA (Herrero et al., 2001 Bioinformatics. 17:126-136), SOM-UPGMA (Herrero and Dopazo, 2002 Journal of Proteome Research. 1(5), 467-470) and others.
2.3. Identification of differentially expressed genes
The tutorial will revise some simple methods for finding differentially expressed genes or genes related to a given parameter or to survival data. The interpretation of the adjusted p-values will be discussed. GEPAS, through the Pomelo tool, implements several basic statistical tests (t-test, one-way ANOVA, linear regression, Cox survival regression, and contingency tables) to identify differentially expressed genes; we provide both unadjusted p-values and adjusted p-values, using control of the Family Wise Error Rate and the False Discovery Rate (see reviews in Dudoit et al., Statistica Sinica, 12:111-139 and Technical Report #110, Division of Biostatistics, UC Berkeley).
2.4. Discrimination methods for classification of samples from microarray data.
The tutorial will cover several methods that have repeatedly shown an excellent performance in terms of predicting error rates. These methods are K-Nearest Neighbor, Diagonal Linear Discriminant Analysis, Support Vector Machines, and Random Forests (Dudoit et al., 2002 JAMA 97:77-87; Furey et al., 2000 Bioinformatics 2000 16: 906-914; Breiman, 2001 Machine Learning 45:5-32). For the first three, we include methods that allow cross-validating the complete process, including gene selection, to avoid selection bias (Ambroise & McLachlan, 2002 PNAS. 99: 6562-6566); for random forests, we include cross-validation of "important genes", as recently proposed in Svetnik et al. (2003, submitted). GEPAS will allow access, through its TNASAS tool (Herrero et al., NAR 2004), to these methods. The tools in TNASAS not only return a predictive model, but provide an unbiased assessment of its predictive performance.
2.5. Functional annotation of the results
Information extraction and textmining techniques have been applied to the analysis of gene expression data (Jenssen, et al., 2000 Nat. Genet. 28: 21-28). Nevertheless, textmining methodologies still have many drawbacks (Blaschke, et al. 2002 Briefings in Bioinformatics 3:154-165) and few users have access to software implementations. An alternative to extracting information from scientific text sources is to use ontologies. In their most simple representation, ontologies provide a structured description of biological knowledge that is extremely convenient for computational management. Gene Ontology (GO; Ashburner, et al. 2000 Nat. Genet. 25:25-29.), which implements information for biochemical function, biological processes and cellular components for a number of different organisms, is widely used. These ontologies can be used as a quick and efficient functional annotation tool for the identification and interpretation of clusters of co-expressing genes studied. In the tutorial different methods for automatic functional annotation will be described in depth. The practical session will include the use the FatiGO tool (Al-Shahrour et al., 2004 Bioinformatics, 20: 578-580; http://www.fatigo.org), which allows the exploration of the biological meaning of groups of genes defined by any of the previously mentioned methodologies, by extracting the list of over and underrepresented GO terms. Significance is provided both with adjusted p-values, using control of the Family Wise Error Rate and the False Discovery Rate. The Babelomics suite (Al-Shahrour et al., 2005, NAR, in press: http://www.babelomics.org) contains different extensions of FatiGO, which includes information from Interpro motifs, pathways (KEGG) and keywords from SwissProt, TFBS, etc.