9:00/13:15  Room 9 (2nd Floor)
T1: Mining Structured Data in Bioinformatics
Prof. Stefan Kramer
The topic of the tutorial is mining in structured data. This is particularly relevant for data mining applications in bioinformatics, since the majority of biological data is not kept in databases consisting of a single, flat table. Instead, we are frequently dealing with databases of structured and linked objects. In other words, the “objects” in bioinformatics databases often have a rich internal structure and are connected by some relation. (Consider, for instance, databases of proteins, small molecules, metabolic and regulatory networks, text databases, etc.) The tutorial will give an overview of data mining techniques for sequences, trees, graphs and relational databases. We will present techniques for both descriptive and predictive data mining in this context. In descriptive mining, we are looking for local patterns to characterize the data. In predictive mining, we are looking for models that can be used to make predictions for new, unseen cases. Along the two dimensions (types of data and predictive/descriptive), the tutorial is organized as follows: the first two parts of the tutorial are devoted to descriptive mining in databases of itemsets, strings and sequences, trees, graphs and relational databases. The third part of the tutorial deals with predictive mining based on propositionalization (i.e., feature construction using patterns), instancebased learning and kernel methods for graph and relational databases.
Synopsis:
The tutorial will give an overview of data mining techniques for sequences, trees, graphs and relational databases. Tree mining is of interest for analyzing text or linguistic data (XML documents, parse trees, etc.), RNA data and phylogenetic trees. Graph mining is useful for finding network motifs in metabolic and regulatory networks, and substructural fragments in small molecules. Mining in relational databases is important for analyzing data from heterogeneous sources in the life sciences.
In the first part of the tutorial, I will explain, in ascending order of complexity, mining in itemset (transaction) data, in string or sequence data, and in tree data. Typical strategies for searching the space of patterns will be presented for the simple case of itemsets (transaction data): breadthfirst (APriori and levelwise search), depthfirst, and combined depthfirst/breadthfirst. Subsequently, the basic ideas will be transferred to mining tree and graph data. Another focus will be on the use of index structures for organizing both the data and the patterns, enabling fast access to information about them. String mining will again be explained in ascending order of complexity: first, we will explain mining interesting substrings (without wildcards) in string or sequence databases, then will extend the approach to substrings containing wildcards. Finally, we will explain serial episodes, i.e., substrings with arbitrary gaps between the characters. In tree mining, different classes of tree patterns can be chosen depending on the application in mind (from embedded to free trees). In the presentation, we will focus on the basic, underlying ideas more than on minor technical details.
In the second part, we will continue with mining general graph data. Several of the previously developed concepts can be applied to graph mining as well. After a brief introduction to the topic, we will sketch two of the most important approaches so far: AGM (performing breadthfirst search), and gSpan (performing depthfirst search). Subsequently, we will give a gentle introduction to mining relational databases. For convenience, we will restrict ourselves to simple conjunctive datalog queries. The Warmr algorithm for finding frequently successful datalog queries will be explained in terms of inputs and outputs.
In the third part of the tutorial, we will give an overview of predictive mining in structured data. First, we will explain how patterns found in descriptive mining can be used to reformulate data for predictive purposes. Second, instancebased learning techniques for structured data will be presented. Last, but not least, a survey of kernel methods for structured data, in particular graphs, is given.
All theoretical concepts and algorithms will be illustrated by running examples from data mining applications in bioinformatics. A comprehensive reference list will be included, with links to online resources and literature where available.
Part I: Descriptive Mining:
From Itemsets to Tree Mining
(90 minutes)
 Transaction data and frequent itemsets
 APriori and levelwise search
 String mining
 Introduction to string mining
 Version space trees
 Mining with wildcards
 Mining serial episodes
 Tree mining
 Introduction to tree mining
 Mining embedded subtrees (TreeMiner)
 Mining induced subtrees (FREQT)
 Mining unordered subtrees (Unot)
 Mining free trees (FreeTreeMiner)
Coffee Break (30 minutes)
Part II: Descriptive Mining:
Graph Mining and Mining Relational Databases
(60 minutes)
 Graph mining
 Introduction to graph mining
 AGM
 gSpan
 Mining relational databases
 SQL, datalog, conjunctive queries, query containment, thetasubsumption
 Warmr algorithm (inputs and outputs)
Part III: Predictive Mining:
Propositionalization, InstanceBased Learning and
Kernel Methods
(60 minutes)
 Propositionalization: using frequent patterns for predictive mining
 Instancebased learning: distance measures for graphs and relational databases
 Kernel methods: graph kernels
