Poster Abstracts for Category J: Bioinformatics for diseases
Poster J01
Modeling the Protein-Protein Interactions of the Pro-Apoptotic Protein ASPP2
Benyamini H., Katz C., Rotem S., Friedler A.
The Institute of Chemistry, The Hebrew University
The protein ASPP2 is emerging as a hub of apoptosis regulation. We studied its interactions experimentally and computationally. The experimental binding results served as a basis for a docking analysis and model building of ASPP2 complexes with its partners. Here we present models for the interactions of ASPP2 with the anti-apoptotic Bcl-2 family members and with NF-kappa-B, both key apoptosis regulators. Our models for ASPP2 protein-protein interactions reveal the basis for its pro-apoptotic activity, and lay the foundations for anti-cancer drug design.
ASPP2 is a 1128-amino acids pro-apoptotic protein, that is emerging as a hub of apoptosis regulation. It interacts with numerous apoptosis regulators, including p53, NF-kappa-B (NFkB) and Bcl-2. The C-terminal 193 residues of ASPP2, containing four ankyrin repeats and an SH3 domain, mediate ASPP2 protein-protein interactions. Bcl-2 family proteins are key regulators in the mitochondrial death pathway. The Bcl-2 family consists of both pro-apoptotic and anti-apoptotic members (e.g. Bcl-2, Bcl-XL and Bcl-W) that cooperate through formation of homo/hetero-dimers that maintain the balance between cell death and survival. We identified the interaction sites between ASPP2 and the anti-apoptotic Bcl family members. To identify the ASPP2-binding regions in Bcl-2 family proteins, we designed peptide arrays containing partly overlapping peptides derived from the anti-apoptotic Bcl-2, Bcl-XL and Bcl-W and screened them for binding ASPP2 893-1128, which contains the ankyrin repeats and SH3 domain (ASPP2Ank-SH3). We found that ASPP2Ank-SH3 binds two homologous sites in all three Bcl proteins: (i) The conserved BH4 motif, essential for inhibition of apoptosis (ii) A known binding site for pro-apoptotic regulators. Sequence alignment of the peptides among Bcl proteins revealed conserved amino acids that may be important for the interaction as well as non-conserved positions where only Bcl2 contains positively charged residues. These may account for its tighter binding with the highly negatively charged ASPP2 as we found by surface plasmon resonance (SPR). To move from the peptide to the protein level and suggest structural models for the complexes between the full-length Bcl proteins and ASPP2Ank-SH3, we performed docking studies. We used the Patchdock and Rosettadock algorithms to dock the structure of ASPP2Ank-SH3 to structures of Bcl-2 and Bcl-XL. For both methods, we examined the top 200 ranked models and re-scored them based on the experimental binding data combined with insights from the binding peptides sequence alignment. We looked for models that their binding sites include a sufficient fraction of the array-detected binding peptides as well as contacts of specific amino acids that are surface exposed and highly conserved, or residues that are likely to contribute to the higher binding affinity of Bcl-2 to ASPP2Ank-SH3. Our docking results suggested a preferred binding orientation of ASPP2Ank-SH3 with both Bcl-XL and Bcl-2. The model contains contacts to ASPP2 at the important amino acids detected by sequence alignment and the residue-residue contacts are conserved in both Bcl-2 and Bcl-XL. The model demonstrates how the functionality of the BH4 and pro-apoptotic site might be compromised upon ASPP2Ank-SH3 binding. Our model was supported by the opposite peptide array experiment, i.e. an array that tests the binding of full-length Bcl-2 to ASPP2 peptides and by quantitative binding tests of Bcl-2 peptides, including mutants of specific residues suggested by our docking model. The constitutive activation of the transcription factor nuclear factor kappa B (NFkB) is a hallmark of many highly malignant tumors. Inactivation of NFkB is thus a therapeutic target. We found using peptide arrays that ASPP2Ank-SH3 binds four NFkB derived peptides. A visual examination of the binding peptides on the structure of NFkB revealed a similarity between the suggested binding surface with ASPP2Ank-SH3 and the known binding surface of NFkB with its inhibitor, I-kappa-B (IkB). The fact that both ASPP2Ank-SH3 and IkB contain ankyrin repeats has led us to hypothesize that the two proteins bind NFkB in a similar manner. To test our hypothesis, we examined the structures and sequences of ASPP2Ank-SH3 and IkB. On one hand, both proteins contain ankyrin repeats; on the other hand, ASPP2Ank-SH3 contains six while IkB contains four repeats. For a similar interaction with NFkB to occur, two conditions should be fulfilled: (i) the IkB binding site should have a corresponding site on ASPP2Ank-SH3; (ii) NFkB should undergo a backbone hinge movement to accommodate the shorter ASPP2Ank-SH3. To check the first condition, we compared the NFkB binding site on IkB to its corresponding putative site on ASPP2Ank-SH3. Both binding sites have a cluster of hydrophobic residues and two negatively charged residues. To check the second condition, i.e. possibility of a hinge bending motion of NFkB, we applied two hinge detection algorithms on the structure of NFkB: HingeProt and El-nemo. Both algorithms results suggest a hinge bending motion of the C-terminal tail. Such a movement may enable NFkB to accommodate ASPP2Ank-SH3. We have performed rigid docking with distance constraint where we docked the C-terminal tail of NFkB to the rest of the structure together with ASPP2Ank-SH3. The third ranked solution retained most of the binding features observed in the NFkB-IkB complex, indicating that ASPP2 may inhibit NFkB similar to IkB. An assessment of the model with the servers FastContact and CoilCheck suggests a low energy complex. For further model assessment, the system is currently simulated by molecular dynamics. Our models for ASPP2 protein-protein interactions reveal the basis for its pro-apoptotic activity, and lay the foundations for anti-cancer drug design.
Keywords: protein-protein interactions, protein docking, apoptosis
Poster J02
Computational redesign of binding of the cytosolic NEP peptide to merlin and moesin proteins
Masha Y. Niv (1), Katsuyuki Iida (2), Rong Zheng (2), Akio Horiguchi (2), David M. Nanus (2)
(1) Hebrew University, Israel, (2) Weill Medical College of Cornell University
Neutral endopeptidase 24.11 (NEP) is a 90- to 110-kDa cell-surface peptidase that is expressed by numerous tissues. Decrease in NEP expression is associated with a variety of malignancies. The anti-oncogenic function of NEP is mediated also through direct protein-protein interactions of NEP's cytosolic region with several partners. Using experimental data, threading and sequence analysis, we identify the interacting region between NEP and moesin, and engineer a related protein merlin into a NEP-binding protein
Neutral endopeptidase 24.11 (NEP) is a 90- to 110-kDa cell-surface peptidase that is normally expressed by numerous tissues, including prostate, kidney, intestine, endometrium, adrenal glands and lung. This enzyme cleaves peptide bonds on the amino side of hydrophobic amino acids and inactivates a variety of physiologically active peptides, including bradykinin, oxytocin, endothelin-1, and bombesin-like peptides. Loss of or decrease in NEP expression have been reported in a variety of malignancies. The anti-oncogenic function of NEP has been found to be not only due to its catalytic activity but also effected by direct protein-protein interaction with other proteins. The ezrin/radixin/moesin (ERM) proteins are important binding partners of NEP. ERM proteins consist of an N-terminal FERM domain followed by a coiled-coil segment and a C-terminal domain containing an actin-binding motif. ERM proteins supply a functional linkage between integral membrane proteins and the cytoskeleton in mammalian cells to regulate membrane protein dynamics and cytoskeleton rearrangement. Ezrin has been shown to promote tumorigenesis: it is necessary for Net- and Dbl-mediated transformation of fibroblastic cells and it enhances metastasis in mouse models of osteosarcoma and rhabdomyosarcoma, but the mechanisms of ERM protein promotion of tumorigenesis need further elucidation. It was previously shown that NEP co-immunoprecipitates with ezrin, radixin and moesin in NEP-expressing LNCaP prostate cancer cells and MeWo melanoma cells. Co-immunoprecipitation showed that ERM proteins associate with wild-type NEP protein but not with NEP protein containing a truncated cytoplasmic domain or with a mutant in which the positively charged amino acid cluster, K19K20K21, was replaced by QNI residues. In-vitro binding assays showed that the positively charged cluster is required for NEP binding to recombinant N-terminus fragments of ERM proteins. Binding of ERM proteins to NEP resulted in decreased binding between ERM proteins and the hyaluronan receptor CD44, a main binding partner of ERM proteins. Cells expressing wild-type (but not mutated) NEP demonstrated decreased adhesion to hyaluronic acid (HA) and cell migration. These data suggest that NEP can affect cell adhesion and migration through direct binding to ERM proteins. A protein that displays significant homology to the ERM proteins is merlin, encoded by the NF2 (neurofibromatosis type 2) gene. Merlin shares its domain organization with ERM proteins but does not contain a canonical actin-binding motif at its C terminus. In addition, although the phosphorylated, presumably open form of merlin can form heterodimers with ezrin and other ERM proteins and localize at the cell cortex, it is the dephosphorylated form of merlin that opposes cell proliferation and transformation and is, therefore, considered active. This anti-mitogenic and tumor-suppressor function of merlin is unique and contrasts with the ERM proteins' functions. Interestingly, merlin also interacts with CD44 (via residues 1-50). While the ERM-CD44 interaction is suggested to be tumor-promoting, the merlin-CD44 interaction inhibits the CD44-HA interaction and thus contributes to the tumor-suppressor function of merlin. Therefore, elucidation of the differences between merlin and ERM proteins in terms of their interactions with their binding partners is essential in order to delineate their complex roles in tumor and metastasis promotion and suppression. Here we investigate the binding between NEP and moesin, an ERM protein, and between NEP and merlin. We show that NEP does not bind merlin. We then use experimental data, threading and sequence analysis to identify residues in ERM that are likely to be the binding determinants of NEP. We show that swapping these binding determinants in moesin to amino acids that occupy the same position in merlin disrupts the binding of moesin to wild-type NEP. Swapping this region in merlin for moesin residues causes gain of binding to wild-type NEP, but not to NEP's QNI mutant. We thus identify the interacting region between NEP and moesin, and engineer merlin to become a NEP-binding protein. These data form the basis for further exploration of the structural details of NEP-ERM binding.
Keywords: cancer, NEP, FERM domain, protein/protein interactions, peptide binding, reengineering
Poster J03
Bag of Peaks: interpretation of NMR spectroscopy
Manuele Bicego (1), Gavin Brelstaff (2), Nicola Culeddu (3), Matilde Chessa (4)
(1) DEIR - University of Sassari - Sassari - Italy, (2) Biocomputing, CRS4 - Pula (CA) - Italy., (3) ICB - CNR - Sassari - Italy., (4) Porto Conte Ricerche - Alghero(SS) - Italy
The analysis of high-resolution, proton, NMR spectrography is often obscured by the adoption of black-box algorithms. We seek an intermediate representation able to furnish a more communicative interface between human expert and machine. The representation is based on peaks, coded and used to compile a dictionary for all traces – also used to further transform the trace data into a format, know as the Bag of Peaks, useful for classification. Our pilot study, of Type I diabetes among Sardinian children, demonstrates the efficacy of Bag of Peaks descriptors over those of a standard PCA.
High-resolution, proton, Nuclear Magnetic Resonance spectrometry (NMR) indicates metabolic composition of bio-fluids such as blood plasma or urine and thus constitutes a useful tool for clinical diagnosis and toxicology. In the resultant trace different metabolic species produce different peaks - depending on the chemical environment of their source nuclei - and hundreds of chemical compounds may thus be revealed to the expert eye in a single act of measurement. However, computer-automated NMR analysis is often served poorly by black-box algorithms - where interpretative features lose amplitude or suffer random variations, seldom is it understood whether the results degrade gracefully in accordance with expert opinion, or due to a peculiarity of the algorithm. PCA-centric techniques (e.g. NIPALS, PRESS, VARIMAX, HCA) though popular may well be unsuited to sets of NMR traces since the dominant statistical variation is not additive Gaussian noise in amplitude but rather due to unpredictable horizontal, left-right drifts in spectral loci of peak features. Indeed those techniques prioritise data reduction at the expense of interpretability. By examining our data-sets we seek an intermediate representation able to furnish a more continuous communicative interface between human expert and machine - to assist in identifying metabolites expressed in diseased biofluids. The proposed intermediate representation is based on the concept of peak: experts tend to reason on the basis of visible peaks, not indistinct undulations or even visible troughs. Any peak with a well-defined structure may serve - not just those with large amplitude - as long as the structure has at least one visible flank from which the expert may gauge both its amplitude and width. Though theory indicates peaks should follow a Lorentzian profile, we found in practice any well-defined peak can be approximated by fitting a simpler Gaussian function across its visible extent. Each trace is represented by the set of its well-defined peaks - each peak being parametized as: p - spectral location of its maximum; a - its maximum amplitude; w - peak width estimated from visible flank(s). The spectral energy of each peak (be it single- or double-flanked) may be thus approximated as the product a.w - the integral beneath the curve if it were indeed Gaussian. This intermediate representation is then used to compile a common dictionary for all traces acquired for the study. This dictionary-based approach is inspired by the so-called "Bag of Words" approach, a successful representation method from the field of linguists. It has the interpretive advantage that the dictionary may be interpreted by the expert and where necessary adjusted. The dictionary is built by clustering of peaks, supplied in a training set chosen by the expert. This is achieved on the basis of similarity of peak locus. The trace data is transformed using the dictionary into a format, which we called as the Bag of Peaks, from which classical supervised classification may proceed. We shall present both further algorithmic details of our approach and, in graphical and tabular form, a comparative evaluation against PCA-based methods bench-marked by four different standard classifiers (ranged from basic to quite complex): Nearest Neighbour, K-nearest neighbour rule, Logistic Linear Classifier, and Radial Basis Support Vector Machine (coded using the PRTOOLS Matlab toolbox). For our experiments we used a data-set of 32 traces acquired by standard protocols from the urine of Sardinian children - with the aim of classifying that half known to suffer from Type I diabetes. Each sample was acquired by an AVANCE 600MHz spectrometer (Bruker Milan, Italy) at 300K operating at 600.13 MHz in 1 H observation mode. To each 400 mul sample aliquot was added 200 mul of sodium phosphate buffer (0.2M Na2HPO4 in H2O and 0.2M NaH2PO4 in 80:20 H2O:D2O, pH 7.4) containing 1 mM sodium trimethylsilyl [2,2,3,3-2H4] propionate (TSP) and 3mM sodium azide. Samples were centrifuged at about 1800xg for 5 minutes to eliminate solid debris. NMR acquisition was performed using the first increment of a NOESY sequence with irradiation of the water frequency during the mixing time and relaxation delay, and adopting 128 FIDs, of 64k data points, over a spectral width of 12376 Hz. Since the number of samples is relatively small from a statistical perspective we carried our evaluations using the well-known Cross-Validation technique called as Leave One Out. The result of comparing the Bag of Peaks representation (33 dimensions) with the PCA representation (31 dimensions) show that the accuracies computed for the PCA is only on a par with the Bag of Peaks for one of the classifier---that which performs worst. Elsewhere the Bag of Peaks consistently gives a more accurate classification. The results attests the efficacy of Bag of Peaks descriptors over those of a standard PCA. Not only do they produce more accurate classifications over a range of dimensionalities, they also deliver practical suggestions for metabolitic peak loci that may be implicated in the disease (e.g. Argine, Creatine and unfamiliar peaks in the range [3.24:3.29] ppm) Acknowledgment: We acknowledge Sergio Uzzau and the facilities provided by at Porto Conte Ricerche, Alghero, in Sardinia.
Keywords: NMR spectroscopy, intermerdiate representation, dictionary, supervised classification
Poster J04
Kernel ENDEAVOUR: a web platform for kernel based gene prioritization by data fusion
Shi Yu (1), Leon-Charles Tranchevent (1), Roland Barriot (1), Daniela Nitsch (1), Tijl De Bie (2), Bart De Moor (1), Yves Moreau (1)
(1) Bioinformatics group, SCD, Department of Electrical Engineering, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, B-3001 Heverlee, Belgium, (2) Department of Engineering Mathematics, University of Bristol, University Walk, BS8 1TR, Bristol, UK
This poster presents Kernel Endeavour(http://www.esat.kuleuven.be/kendeavourweb), a latest update version of Endeavour as a web platform for the prioritization of genes. In Kernel Endeavour, gene prioritization is regarded as a kernel based novelty detection problem and information from multiple data sources is integrated as a weighted combination of kernels. It has been shown in previous research that the kernel based method has the advantage of weighing different sources and obtain better performance than original Endeavour when the specific genes are given.
Kernel Endeavour (http://www.esat.kuleuven.be/kendeavourweb) is the latest update version of Endeavour, a web platform for the prioritization of genes. In Kernel Endeavour, gene prioritization is regarded as a kernel based novelty detection problem and information from multiple data sources is integrated as a weighted combination of kernels. It has been shown in previous research that the kernel based method has the advantage of weighing different sources according to their relevance to the prioritization problem when the specific genes are given. Due to the adaptivity in data fusion, kernel based gene prioritization gets better performance than original Endeavour system in disease based genes validation (De Bie et al., 2007). In this poster, we will discuss the implementation of Kernel Endeavour in the following aspects: * Conceptual Overview of Kernel Endeavour * Kernel based algorithm: one shot prioritization by data fusion * Adpative fusion of pre-computed kernels * Dimensionality reduction of kernels containing full genomic information * Speration of offline and online processes for rapid web response
Keywords: Kernel method, Gene prioritization, Disease gene
Poster J05
Dealing with bias in clinical data sets: Distribution matching and transfer learning for HIV therapy selection
Bogojeska J., Bickel S., Scheffer T., Lengauer T.
Max Planck Institute for Informatics
We tackle the problem of bias in HIV data sets used for predicting the outcome of combination drug therapies. We develop a novel method that trains a separate model for each drug combination by using data from all available therapies with proper weights. This enables predictions even for therapies with very few or no training examples and copes with the scarceness and uneven representation of therapies in the clinical databases. We also address the evolution of the viral genome in response to treatments over time. Our method significantly improves the accuracy of predicting therapy success.
The Human Immunodeficiency Virus (HIV) targets the human immune system and leads to the acquired immunodeficiency syndrome (AIDS). Despite the large number of available antiretroviral drugs the virus cannot be eradicated completely from the patients’ body and AIDS continues causing high rates of mortality. The weapon that makes HIV so vigorous is its high genetic variability that enables the virus to develop dynamic quasispecies harboring drug resistant mutants. It has been observed that combinations of several drugs can lead to prolonged virus suppression and restoration of immunologic function. This has paved the way for drug combinations to become the standard mean for treating HIV patients. However, eventually HIV will develop resistant variants to any drug combination. This makes finding a successful regimen while keeping future therapy options open a major challenge for AIDS treatment. Considering the large number of available drugs the number of potential therapy combinations is too big for human assessment. Thus in recent years, statistical models of therapy success have been developed on the basis of clinical training data. The relevant data sets that comprise information for the therapies of many patients are biased in many ways: both the viral sequence and the drug combinations evolve over time; many therapy combinations are underrepresented and others are not present at all; data are collected under different constraints. Therefore the data comprising clinical databases cannot be considered “objective”. Adequate methods dealing with the bias in the data according to the physicians demand are required. The predictions of the outcome of a therapy use information about the presence of a set of resistance-relevant mutations in the virus genotyped from the patient, and his/hers historic treatment records. Here we present a novel method for predicting the outcome of a potential therapy of a patient by training a separate model for each particular therapy (Bickel et al.). Each model is trained by using all available training data with proper weights. The weights are derived such that the distribution of the examples of all therapies is matched to the distribution of the therapy of interest. Moreover, we use prior knowledge about the similarity of the therapy combinations represented by suitable kernel functions. In this way the model for each therapy can benefit from all information available for similar therapies. This transfer learning approach is especially useful to circumvent the problem of many drug combinations being present with very few or no examples in the training set. The similarity between two different therapy combinations is quantified with two different kernel functions. According to the first one the pair-wise similarities of two different therapy combinations are based on the set of common resistance-relevant mutations of their respective sets of drugs. The second kernel derives the similarities by comparing phenotypic information for the resistance of the drugs comprising the therapies of interest. There is a constant evolution of the viral sequence under drug pressure over time. The treatments also change with the introduction of new drugs. We address this issue by using time-consistent splits when choosing the training and the test sets - the most recent data samples are selected as a test set and the rest are used as a training set. In this way our models learn from data seen in the more distant past and their performance is measured on unseen data from the more recent past. The experimental results obtained on the clinical dataset from the EuResist project (Rosen-Zvi et al.) show that our method significantly improves the overall prediction accuracy. 1. Bickel, S., Bogojeska, J., Lengauer, T., Scheffer, T. Multi-Task Learning for HIV Therapy Screening. Proceedings of the ICML. (2008) 2. Rosen-Zvi, M., Altmann, A., Prosperi, M., E., A., Neu-virth, H., Snnerborg, A., Schlter, E., Struck, D.,Peres, Y., Incardona, F., Kaiser, R., Zazzi, M., Lengauer, T. Selecting anti-HIV therapies based on a variety of genomic and clinical factors. Proceedings of the ISMB. (2008)
Keywords: HIV, distribution matching, transfer learning, bias
Poster J06
IDENTIFICATION OF SIGNIFICANT OVERLAP OF DIFFERENTIALLY EXPRESSED AND GENOMIC IMBALANCED REGIONS IN CANCER DATASETS
Ferrari F. (1), Spinelli R. (2), Mangano E. (3,6), Beltrame L. (2), Zampieri M. (4), Cifola I. (2), Peano C. (2), Bicciato S. (5), Battaglia C. (3,6)
(1) Dept. of Biology, University of Padua, Padova, Italy, (2) Institute of Biomedical Technologies (ITB), National Research Council (CNR), Milan, Italy, (3) Dept. of Biomedical Science and Technologies and PhD School of molecular medicine, University of Milan,Italy, (4) SISSA-ISAS, International School for Advanced Studies, Trieste, Italy, (5) Dept. of Biomedical Sciences, University of Modena and Reggio Emilia, Modena, Italy, (6) Interdisciplinary Center for Biomolecular Studies and Industrial Applications (CISI), University of Milan, Milan, Italy
We present a bioinformatics procedure that allows the identification of genome-wide, concurrent alterations of copy number and regional gene expression in single and multiple cancer samples through the integration of copy number and transcriptional data with genomic structural information.
INTRODUCTION The integration of high-throughput genomic and transcriptional data with gene structural information (i.e., chromosomal localization) and functional characteristics represents an opportunity for deciphering how the structural organization of genomes influences its functional utilization and for discovering novel cancer biomarkers. However, the development of integrative approaches to complement gene expression profiling data with other types of gene information still represents a computational challenge. The purpose of this work is to present a bioinformatics procedure that allows integrating copy number (CN) and transcriptional data, identifying genome-wide, concurrent alterations of copy number and regional gene expression (GE) in single cancer samples, and extending the integrative analysis to multiple cancer datasets. METHODS We present a computational method for the identification of Significant Overlaps of Differentially Expressed and Genomic Imbalanced Regions (SODEGIR). The approach can be divided into three steps: 1) integration of copy number and gene expression data with structural information; 2) statistical estimation of copy number and transcriptional statuses and identification of regional chromosomal aberration (SODEGIR) on a single-sample basis; 3) aggregation of SODEGIRs from different samples to obtain global signatures of specific tumor types. Step 1 and 2 represent the main features of the algorithm. The first step integrates copy number and gene expression data from high-throughput technologies (e.g., Affymetrix transcription and mapping arrays) with structural information (i.e. gene positions) for the identification of imbalanced chromosomal regions. This step is based on a kernel smoothing procedure with adaptive bandwidth which accounts for variations in gene density without any assumption on gene distribution [1-2]. Copy number and gene expression statistics at input design points (i.e., SNP or probe-set chromosomal localization) are estimated at output design points (i.e., Entrez GeneID chromosomal positions) using a locally adapted regression bandwidth which accounts for heterogeneous densities of CN and GE probes along the chromosomes. In the second step the statistical relevance of modulated chromosomal regions is evaluated using a permutation and data categorization scheme [2]. The G statistics are firstly randomly assigned to G chromosomal locations through permutations and then, for each permutation, smoothed over the chromosomal coordinate using the kernel regression function. Thus, observed and random statistics are smoothed and compared exactly over the same region, taking into account variations in the gene distances and density. The significance of the CN and transcriptional imbalances is empirically computed as the probability that the random statistic exceeds the observed statistic over B permutations. Once the distribution of empirical p-values has been generated, the q-value is used to determine significant overlaps of differentially expressed and genomic imbalanced regions (SODEGIR) at the chromosome level of a single sample. The third step assesses the presence of common SODEGIR signature across multiple samples. Specifically, SODEGIR from all single sample analyses are aggregated to generate summary scores for amplifications and deletions using a binomial distribution test and the q-value to correct for multiple hypothesis testing. The whole methodology has been applied on Affymetrix gene expression and mapping data including a total of 66 HG-U133 Plus 2.0 and 215 Human Mapping arrays (100K and 250K sets). The sample datasets include: normal samples (Affymetrix reference); a renal cancer cell line (Caki-1); astrocytoma samples [3]; paired normal/tumor samples of clear cell renal carcinomas (RCC) [4]. RESULTS AND DISCUSSION Sample datasets have been used to optimize the parameters of the computational procedure. In particular, attention has been devoted to assess the effects of different methods for estimating CN, to select the appropriate CN and GE statistics for single sample analysis, to control the adaptive regression bandwidth and to quantify the level of integration between CN, GE and structural information. After parameters optimization, the SODEGIR approach has been applied to the cancer data sets and allowed identifying signatures of deletion (CN loss and GE down-regulation) and amplification (CN gain and up-regulation). In addition our results highlighted a significant quantitative relationship between CN and GE. The aggregate analysis of tumor samples resulted in a SODEGIR signature for each tumor type: the astrocytomas are characterized by amplifications on chromosome 7 and deletions on chromosome 10, whereas RCCs show amplifications on chromosome 5 and deletions on chromosome 3 [5, 6]. The entire procedure is platform-independent and applicable to large cohorts of tumor samples. SODEGIR approach represents a novel method to integrate structural and functional high-throughput data for discovering new potential tumor markers. REFERENCES [1] Herrmann E. Journal of Graphical and Computational Statistics, 1997, 6:35-54 [2] Callegaro A. et al., Bioinformatics, 2006, 22(21):2658-66. [3] Kotliarov Y. et al., Cancer Res. ,2006, 66(19):9428-36. [4] Cifola I. et al., Mol. Cancer., 2008, 14;7(1):6. [5] Beroukhim R. et al., Proc Natl Acad Sci U S A, 2007, 104(50):20007-12. [6] Strefford JC. et al., Cancer Genet Cytogenet., 2005, 159(1):1-9. ACKNOWDLEGEMENTS This work was supported by University of Padova (CPDA065788/06 and CPDR074285/07), Fondazione CARIPARO (Progetti Eccellenza 2006), MIUR (FIRB RBLA03ER38) and funds to Interdisciplinary Center for Biomolecular Studies and Industrial Applications (CISI) and Dept. of Biomedical Science and Technologies, University of Milano.
Keywords: Genotyping, gene expression, microarrays
Poster J07
Robust multiclass predictor based on redefined genes: study on leukemias to reveal gene signatures associated
Fontanillo C., Risueño A., Prieto C., De Las Rivas J.
Cancer Research Center (CIC, CSIC/USAL), Salamanca, Spain
We have built a robust multiclass leukemia predictor based on expression signals of genes re-defined by mapping the probes of Affymetrix microarrays to the current ensEMBL human transcriptome. After gene re-definition, statistical and machine learning methods have been used to build a predictor that reflects the biological entities behind the classes. Biological significance of the marker genes is assessed by comparing the error of optimum classifier with the distribution of errors obtained with series of predictors built by random selection of genes with decreasing statistical significance.
BACKGROUND: Microarray data and genome-wide expression profiles derived are widely used in functional genomics and have become very useful to build biological class predictors using different machine learning methods and strategies. Disease classification and outcome predictors is probably one of the main applications of such technologies. However, there are some fundamental problems that have to be addressed for a proper use and value of these data: (i) Most predictors based on expression profiles obtained from microarrays data identify as genes a pre-determined set of oligonucleotide probes defined by the manufacturer. This implies a lot of biological noise because much of the genome coding information enclosed in the defined probe sets is out-of-date. In fact, it has been reported that many of the current genes/transcripts differ from their original definitions when mapping the probes to the new genome information. Besides many genes are defined by the manufacturers with multiple probe sets, and expression values of these probe sets corresponding to the same gene can be very different. (ii) It is not clear in most of the class prediction methods how to find the best predictors, which provide the minimal errors, but which also reflect the meaningful biological signature underlying the separated classes. (iii) Another issue is that many class prediction methods are “black-box” with respect to the biological entities (i.e. genes) that are behind the predictor figures (NN, kNN, discriminant analysis) and therefore they are not useful to allow a proper discovery of the biology behind classification. RESULTS: We have built a robust multiclass leukemia predictor based on expression signals of re-defined genes on Affymetrix U133 plus 2.0 GeneChip microarrays. The gene identification was based on a complete sequence re-mapping of all the microarray oligonucleotide probes to the current human transcriptome. Using BLAST algorithm, we matched the probes to ensEMBL human transcripts and genes and grouped all probes mapping within a gene locus to a single new probe set. This strategy sorts out obsolete gene mapping and ambiguous assignment of multiple probe sets to one gene. After genes re-definition we use the a combination of statistical and machine learning methods to build a predictor that reflects, in a transparent way, the biological entities behind the different leukemia classes (ALL, AML, CLL and CML) . First, we use the Parametric Empirical Bayesian method (PEB) [1] to explore the differential transcriptomic profiles and find the genes that best discriminate each type of leukemia from the rest and from a set of non-leukemia samples (NoL). This step allows an adequate feature selection (i.e. gene selection) to be used in the classifier. Second, we build a classifier using multiclass Support Vector Machine (mcSVM) [2]. This is a robust method that allows to identify key discriminatory elements (called Support Vectors) both in the samples’ space and in the genes’ space. Third, we apply double nested Cross-Validation (dnCV) method [3] to evaluate the generalization errors of the predictors. All these steps provide an accurate multiclass leukemia predictor that was furtherly tested with an extrinsic set of samples. Finally, to asses the biological significance of the genes included as markers in the predictor, we compare the error of this chosen optimum classifier with the distribution of errors obtained with series of predictors built by random selection of genes with decreasing statistical significance for each class. If we assume that random selected predictors should not reveal the biological signature of each class, the distance between random errors and the optimum generalization error should show how well defined is the biological signature of each given class. As far as we know none of the current methods to build predictors address this point, but we think that it is critical to find biologically significant markers for specific classes. In this context, the feature selection step was proven to be critical for the adequate construction of a predictor with biological meaning. BIBLIOGRAPHY 1.Kendziorski, C. M., Newton, M. A., Lan, H. and Gould, M. N. (2003). On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Statistics in Medicine 22, 3899-3914 2.Weston J., and Watkins C. (1999). Support Vector machines for multi-class pattern recognition. Proceedings of the Seventh European Symposium on Artificial Neural Networks (ESANN 99), Bruges, April 21-23 3.Barrier A., Lemoine A., Boelle P., Tse C., Brault D., Chiappini F., Breittschneider J., Lacaine F., Houry S., Huguier M., Van der Laan M., Speed T., Debuire B., Flahault A. and Dudoit S. (2005). Colon cancer prognosis prediction by gene expression profiling. Oncogene, 24: 6155-6164
Keywords: machine learning, gene expression, class prediction, leukemia
Poster J08
Functional symbiont-host relationship between Epstein-Barr virus and Homo sapiens
Aymeric Fouquier d'Herouel, Maria Werner
Royal Institute of Technology - Computational Biology - AlbaNova University Center - 10691 Stockholm, Sweden
We present a glimpse into the genetic entanglement between the Epstein-Barr virus and its human hosts. This is exemplified by a bioinformatic study of the viral EBNA-1 protein’s binding sites on the human genome using affinity matrices generated by standard methods and an enhanced biophysical inference approach. Constructing a map of putative binding sites, we further compare this against available expression data profiles to identify promoters of specific interest in cancer development. We finally discuss experimental methods we intend to apply to test our predictions.
Epstein-Barr virus (EBV) is a human herpes virus that infects more than 90 % of the human population and thereafter persists for a lifetime in the host. The infections can be asymptomatic but EBV is also strongly correlated to various types of cancer, such as Burkitt's lymphoma and Hodgkin's disease. This increased tumor risk is most likely due to the viral strategy for survival and spread, where the ability of the virus to transiently induce proliferation of latently infected B-lymphocytes results in an increased pool of infected cells. Induction of proliferation depends on the switch between different viral latency programs in the cell, driving the transition between active cell cycles and resting. This induction mechanism is dependent on activation/silencing of the viral C promoter (Cp) that controls expression of key viral proteins. Of specific interest is the Epstein-Barr nuclear antigen 1 protein (EBNA-1), which is responsible for viral replication, episome partitioning as well as functioning as a transcription factor. EBNA-1 activates its own production from Cp by binding to an upstream enhancer called Family of Repeats (FR). FR consists of multiple EBNA-1 and octamer binding sites in an alternating pattern. The octamer sites in FR have been shown to bind human transcription factors from the POU and Groucho/TLE families, suggesting an intricate interplay between the virus and the host in controlling the viral latent states. The activity of Cp is repressed upon binding by Oct-2, a member of the POU family. Here we present a bioinformatic study of how EBNA-1 might interact with the human genome, illustrating the genetic entanglement between the virus and its host. Functional binding sites for EBNA-1 have been studied in mutational assays [Ambinder91, Zou06] and permit the construction of affinity matrices for the factor's binding to DNA. We use both common bioinformatic approaches and enhanced biophysical inference methods to construct such matrices, and evaluate them on the complete human genome (NCBI36 49.36k). The resulting map of putative EBNA-1 binding sites is analyzed for possible feedback pathways down-regulating the expression of Oct-2. Further, we compare the map against available expression data profiles to identify promoters of specific interest in cancer development and discuss experimental methods we intend to apply to test our predictions.
Keywords: transcription factors, EBV, cancer
Poster J09
Extensive analysis of human blood group antigen genes suggest non neutral evolution at multiple loci
Fumagalli M. (1,2), Pozzoli U. (1), Menozzi G. (1), Cereda M. (1), Cagliani R. (1), Comi G.P. (3), Bresolin N. (1,3), Sironi M. (1)
(1) Scientific Institute IRCCS E. Medea, Bioinformatic Lab, Via don L. Monza 20, 23842 Bosisio Parini (LC), Italy, (2) Bioengineering Department, Politecnico di Milano, P.zza L. da Vinci, 32, 20133 Milan, Italy, (3) Dino Ferrari Centre, Department of Neurological Sciences, University of Milan, IRCCS Ospedale Maggiore Policlinico, Mangiagalli and Regina Elena Foundation, Via F. Sforza 35, 20100 Milan, Italy
Historically, allelic variations in blood group antigen (BGA) genes have been regarded as susceptibility traits for infectious diseases. Since host-pathogen interactions are major determinants in evolution, BGAs can be thought of as selection targets. We show that 6 BGA genes have been subjected to balancing selection. Moreover, allele frequencies of 12 BGA loci correlate with pathogen richness calculated for the geographic locations of 52 human populations, indicating pathogen-driven selection. Therefore, BGAs have played a central role in the host-pathogen arms race during human evolution.
Background Since the discovery of the ABO blood group in 1900, 29 blood group (BG) systems have been identified in humans. Each system is specified by a blood group antigen (BGA) constituted by a protein or carbohydrate molecule which is expressed on the erythrocyte membrane and is polymorphic in human populations. BGA genes belong to different functional categories including receptors, transporters, channels, adhesion molecules and enzymes. BGA polymorphisms have attracted considerable attention over recent years due to the possibility that variations in BGAs might underlie different susceptibility to diseases. Given this premise and the conundrum whereby host-pathogen interactions are major determinants in evolution, BGA can be thought of as possible targets of diverse selective pressures. This view is in agreement with the geographic differentiation pattern observed for BGAs and with previous reports of non-neutral evolution at the ABO, DARC, GYPA and FUT2 loci. In this work we exploited the availability of extensive resequencing data, as well as of SNP genotyping in world-wide populations, to investigate the evolutionary forces underlying the evolution of BGA genes. Methods In order to study the evolutionary forces shaping variability in BGAs we exploited the fact that 22 out of 38 loci involved in BGA specification have been included in the SeattleSNPs program so that resequencing data in at least two populations (one Caucasian and one African) are available. When suitable, we resequenced gene regions in other samples from Afroamerican and East Asian populations. Selective processes leave a signature that can be identified through the application of population genetics statistics. Widely used tests analyze features like sequence nucleotide diversity, allelic frequencies spectra, population genetic differentiation, excess or loss of polymorphisms. Statistical significance was assessed both performing coalescent simulations and comparing with a control data set of 238 genes resequenced by the NIEHS program. For all BGAs nucleotide diversity parameters and summary statistics were calculated both over the entire gene and in overlapping windows and we looked for gene regions showing unusual features to select for further analyzes. We next wished to verify whether allele frequencies of SNPs in BGA genes varied with pathogen richness, in terms of different species in a geographic location. To this aim we exploited the fact that a set of over 650000 tag SNPs has been typed in 52 populations (HGDP-CEPH panel) distributed world-wide. As for pathogen richness, we gathered information concerning the number of different micro-pathogen species from the Gideon database. A total of 264 SNPs in BGA genes had been typed in the HGDP-CEPH panel allowing analysis of 26 loci. Association was retrieved by Kendall's rank correlation coefficient and Bonferroni correction for multiple tests. For each significantly associated BGA SNP, we retrieved all SNPs in the full dataset having an overall allele frequency differing less than 0.001; a total of 287430 SNPs constituted the control set. Results We obtained strong evidences of balancing selection for CD55, CD151, SLC14A1, GYPC and BSG genes; moreover, we identified an extended gene region immediately upstream the transcription start site of FUT2. In particular we show that in at least one human population these loci have been subjected to balancing selection, a situation whereby nucleotide variability is maintained throughout time leading to a presence of two or more common haplotypes separated by deep branches. These loci were characterized by an excess of nucleotide diversity and polymorhism level and, in some cases, by a significant reduction of genetic differentiation between populations. Furthermore, times to most recent common ancestor calculated for all these regions vary from 2 to 4 million years ago, suggesting an ancient origin of balancing selection. We next verified that 28 BGA gene SNPs were significantly associated to pathogen richness. Since variables different from selective forces are expected to affect allele frequency spectra across populations, we compared the strength of BGA gene SNP correlations to a set of control SNPs in the dataset, and all correlated SNPs have been found over the 95th percentile rank of this distribution. Also none of the associated SNPs significantly correlated with mean temperature and maximum precipitation rate. The stronger correlation was obtained for rs900971 in SLC14A1. Conclusions Haldane's hypothesis posits that infectious diseases have been a major threat to human populations and have therefore exerted strong selective pressures throughout human history; furthermore he suggested that antigens constituted of protein-carbohydrates molecules possibly play a role in resistance/predisposition to pathogen infection. These lines seem to perfectly fit BGA genes, as demonstrated by both this study and previous descriptions. In this scenario, it is not surprising that BGA genes have been the target of selective pressures and that associations between pathogen richness and BGA alleles can be identified. Indeed, here we show that 6 BGA genes have been subjected to balancing selection (the underlying selective pressure possibly being an infectious agent) and that pathogen richness has shaped allele frequencies in 12 genes. These data, together with previous description, indicate that BGAs played a central role in the host-pathogen arms race during human evolutionary history.
Poster J10
Comparative QSAR Modeling of Anti-tubercular Activity of Nitrofuranyl Amide Derivatives
Payel Ghosh, Dr. Manish C. Bagchi
Indian Institute of Chemical Biology, 4, Raja S. C. Mullick Road, Jadavpur, Kolkata 700032, India
The anti-tubercular activity of 47 nitrofuranyl amides were analyzed employing various linear regression methods as well as non-linear counter-propagation neural network method to provide a reliable 2D-QSAR model from a set of more than 900 descriptors. Concurrently, the steric and electrostatic interactions between a probe atom "CH3+" and a set of aligned molecules were evaluated using the Molecular Field Analysis to develop a corresponding 3D-QSAR model. The study leads to a better understanding of the requisite pharmacological properties of these compounds to be applied as potent drugs.
The present QSAR study attempts to explore the structural and physicochemical requirements of a series of second generation biologically active 47 nitrofuranyl amides with substitution at cyclic secondary amines [Tangallapally et.al. J. Med. Chem. (2005) and Bioorg & Med Chem Letters(2006)]. We built 2D and 3D-QSAR models aiming to understand better the observed Biological Activity (BA) of a series of nitrofuranyl derivatives. The distribution of the natural minimum inhibitory concentration data (MIC) in μg/mL did not correspond to the main statistical requirement for the normal distribution of the data values. Therefore, a logarithmic transformation of data [p(MIC) in Molar concentration] was used, which resulted in a normal distribution curve. The present investigation provides further evidence that a combination of these two different approaches (2D and 3DQSARs) leads to pertinent QSAR models. In addition, such a joint treatment should help provide robust predictions. PreADMET software was used to calculate a set of nearly 900 descriptors. Quantitative structure activity relationship (QSAR) models based on calculated physicochemical, topological and other groups of descriptors have been extensively used in the present study for predicting biological activity for these derivatives. We have developed various regression models such as ridge regression, stepwise regression and partial least squares for activity prediction. Attempt has also been made for applying more recently explored genetic (genetic function approximation) and machine learning (k-nearest neighbor and neural networks) methods for a clear understanding. Validations of models are performed utilizing training and test sets. Results based on leave-one-out (LOO) principle have been discussed in case of counter propagation neural network analysis. Finally, the relative effectiveness of the molecular descriptors in these models have been compared and discussed. The VLifeMDS (version 3.5) software was used to develop a corresponding 3D-QSAR model. The three-dimensional conversion and pre-optimization of molecules were performed using the molecular mechanics force field (MMFF) of this software. The final geometry optimization of the molecules was carried out using the semi-empirical AM1 parameterization. Then the molecules are aligned with respect to its ring structure. The next step was to build a 3D grid around the set of superimposed molecules using a 2 Å grid constant. The steric and electrostatic interactions between a probe atom (CH3+) and a set of aligned molecules were assessed using the comparative molecular field analysis method. Thus, the interaction energies between the probe atom and the aligned molecules were calculated for each grid point using 10 kcal/mol and 30 kcal/mol cut-offs by default for electrostatic and steric fields respectively. In the next step, some or all of the grid data points can be used as descriptors in generating 3D-QSARs and analyzing structure– activity relationships. The Partial Least Squares (PLS) method was used to obtain the 3D-QSAR model. The first step in this method was to calculate the principal components and the second one was to obtain the model and its statistical parameters. The results from 2D- and 3D-QSAR analyses show that the anti-tubercular activity of the studied series of nitrofuranyl amides is strongly dependent on electrostatic interactions.
Keywords: Anti-TB Drug,QSAR,PLS,Counter Propagation ANN
Poster J11
Understanding drug Mode-of-Action: Mining the cMap dataset
Francesco Iorio (1), Roberto Tagliaferri (2), Diego di Bernardo (1)
(1) Telethon Institute of Genetics and Medicine (TIGEM), (2) Dept. of Mathematics and Computer Science - University of Salerno
We developed a tool to identify the mode-of-action (MOA) of new drugs. Starting from microarray data related to drugs, we identify those whose MOA is similar to that of a new drug. We represent each drug as a point in a space with distance proportional to MOA similarity. In order to classify the MOA of a new drug, we embed a gene expression profile, following treatment with the drug, in the space and we analyse its neighbors. Several metric was assessed and the amount of prior knowledge needed was evaluated. Moreover we developed a novel measure combining sample correlation and ANOVA.
Identifying pathways mediating a drug mode of action is a key challenge in biomedicine. We demonstrated that using gene expression profiles in yeast, it is possible to detect the mode of action of a drug candidate [di Bernardo et Al. Nature Biotechnology - 2005]. Recently Lamb et Al. developed a large public database of expression signatures of drugs and genes, called Connectivity Map (cMap). Starting by a this compendium of gene expression profiles, obtained treating different cell lines with more than 150 different compounds, our aim was to analyze the possibility of building a “Mode-of-Action Mapping” among these compounds using as little as possible prior knowledge. This mapping allows identifying the Mode-Of-Action of new drugs by “similarity” to known drugs in the cMAP dataset. In addition, it can be used to visualize compounds in a 3D map. In order to compute a robust “similarity” score, we investigated the use of a new metric in which the Sample Correlation Distance and the Analysis of Variance p-values were combined. The new metric space was built using Euclidean distances and dimensionality reduction techniques. Despite the complexity and the variety of the experimental conditions, this new metric is able to identify similar drugs from their expression profiles, even if drug effect on expression is very small compared to the “noise” due to differences in tissues and platforms used for expression analysis. We validated the new metric by means of Receiver Observation Characteristic analysis. For each genome-wide expression profile in response to a drug treatment, we observed its closest k-neighbors in the mapping. We checked, among the neighbors, the presence of other profiles obtained treating different cell lines with the same drug (True Positives - TP). We obtained a precision (TP/k) varying from 40% to 62%. Our unsupervised approach provides a precision close to the one that can be obtained using a supervised approach, which assumes perfect knowledge of the compounds in the dataset (precision = 84%). Our results provide a new framework for robust analysis of drug-induced expression profiles. For future work we are planning to switch from this set of classical tools of multivariate data analysis to a modeling approach in which we will take into account of the interactions occurring between genes (modeled as systems of Ordinary Differential Equations).
Keywords: drug, unsupervised classification, mode of action, connectivity map
Poster J12
Predicting response of HIV patients using SNP-drug interactions
Marek D. (1,2), Tarr P. (3), Telenti A. (4,3), Beckmann J. (5,1), Bergmann S. (1,2)
(1) Department of Medical Genetics, University of Lausanne , (2) Swiss Institute of Bioinformatics, (3) Infectious Diseases Service, CHUV, Lausanne, (4) Institute of Microbiology, University of Lausanne, (5) Medical Genetics Service, CHUV, Lausanne
Understanding why some HIV patients under antiretroviral therapy develop dyslipidemia, is a typical goal of pharmacogenetics. We analyzed a dataset from the Swiss HIV cohort containing 6183 measurements of lipid levels from 438 patients who underwent different treatments, as well as their genotype for several SNPs in genes involved in lipid transport and metabolism. Our two-stage approach provides an efficient method to select relevant SNP-drugs interactions. It generates a robust framework for predicting the lipid responses of new patients based on their genotype and a choice of treatment.
Understanding why some HIV patients treated with antiretroviral therapy develop adverse side effects, such as dyslipidemia, is one of the goals of pharmacogenetics. We analyzed a dataset from the Swiss HIV cohort containing about 6100 measurements of lipid levels from 438 patients that underwent different treatments (up to 5 out of 15 drugs), as well as their genotype for 16 SNPs in genes involved in lipid transport and metabolism. Using a linear regression model, taking the genotypes of the SNPs and the usage of the drugs as independent features, we were able to explain about 16% of the variance of the triglyceride levels (corrected for sex and age). We then extended our model to contain also SNP-drug interaction terms. In order to avoid over fitting, we only included a fraction of the 240 possible interactions. To this end, we first tested 240 minimal models allowing only for a single SNP-drug interaction. We then included into the full model only those interactions whose significance was above a threshold. Interestingly, the addition of only 40 SNP-drug interactions raised the variance explained by the model to 22%. This value was never reached by a random pick of 40 interaction terms. These findings reveal that while the independent effects of the SNPs and the drugs explain part of the changes observed in lipid levels, interactions of some SNPs and drugs significantly improve the fitting of the data. Our two-stage approach provides a simple yet efficient method to select relevant SNP-drugs interactions and can be extended to integrate other type of interactions (e.g. SNP-SNP or drug-drug). It generates a robust framework for predicting the lipid responses of new patients based on their genotype and a choice of treatment. Methods: Using the HIV cohort data, we have aimed to show how the drug responses (lipid response (LR)) could be explained by a set of features: SNP genotypes (G), treatment (T) and in particular by interactions between a SNP and a drug (GT). We have developed a two-stage multiple regression approach. First, we have run the regression and estimated the coefficients simply by using at each time a pair of SNP and drug and its interaction term. Secondly, we have computed a score for the regression coefficients corresponding to the bilinear terms encoding SNP-drug interactions. These scores reflect the distance of a coefficient from 0 in units of the half of its confidence interval. We have then performed a global regression, using all the 16 SNPs and all the 15 drugs as linear terms, as well as the 40 SNP-drug interactions that received the highest scores in the first stage. We used the proportion of variance explained (R-squared) by the global model as a measure of its performance. We compared R-squared and also ROC curves with that of control models that use the same linear terms and the same number, yet randomly picked interaction terms.
Keywords: Pharmacogenetics, Model selection, Linear regression, Data fitting, HIV
Poster J13
GenSense: An "end to end" genome wide association study platform
Kalaitzopoulos D. (2), Van der Hall P. (1), Pescatori M. (1), Van den Hout M. (1), Munro R. (2), Verkerk A. (1), Van der Spek (1), Stubbs A. (1)
(1) Dept. of Bioinformatics, Erasmus University Medical Center, Dr Molewaterplein 50, 3000 CA Rotterdam, NL, (2) InforSense Ltd, Colet Court, 100 Hammersmith Road, London, W6 7JP, UK
As genome-wide association (GWA) studies become more commonplace, analytic software must be highly flexible to enable ad hoc integration of many different types of data and algorithms and compare outputs from different approaches. The Departement of Bioinformatics (ErasmusMC) in collaboration with InforSense Ltd, have implement a flexible, scalable and robust statistical pipeline to analyse GWA study data. The 9p21 association with coronary artery disease phenotype from the WTCCC GWA study was successfully replicated using this GenSense GWA pipeline.
Motivation The main goals for Genome Wide Association (GWA) studies are to find genetic variants, single nucleotide polymorphisms (SNPs) that are correlated between case or control individuals. As GWA studies become more commonplace, analytic software must be highly flexible to enable ad hoc integration of many different types of data and algorithms and compare outputs from different approaches. Methodology The Center for Bioinformatics (Erasmus Medical Center) in collaboration with InforSense Ltd, have utilised the GenSense architecture to implement a flexible, scalable and robust statistical pipeline to analyse GWA study data. GenSense (by InforSense Ltd), is a scalable analytical middleware, designed to assist researchers to understand complex analyses, quickly identify correlated SNPs and to interactive visualizations of large datasets from the latest generation of genotyping platforms. A replication of the case control study for coronary artery disease (CAD) using data obtained from the Welcome Trust Case Control Consortium (WTCCC) performed in order to demonstrate the robustness and accuracy of GenSense in analysing GWA data. We used approximately 3,000 control samples and 2,000 CAD samples with permission from the WTCCC in our validation experiments. These data were imported as either WTCCC genotyped data or imported and genotyped using BRLMM into GenSense in a proprietary data format designed to run the GWA study in memory. The system performance and scalability of the internal data representation was tested with sample data from the Illumina HumanHap650Y Genotyping BeadChip containing 655,352 SNPs and benchmarked with 159 to 2544 samples. Results The 9p21 association with the CAD phenotype from the WTCCC GWA study was successfully replicated using this GenSense GWA pipeline. Based on our testing, the core statistical algorithms scale in a linear fashion with respect to the amount of samples. Summary The Erasmus MC GenSense GWA pipeline successfully replicated the 9p21 association with CAD originally found by the WTCCC. This replication study demonstrates the robustness and accuracy of GenSense in analyzing GWA data. The open architecture of this platform allows for the rapid integration of existing workflow components (e.g. R genetics, PLINK) to this stable workflow enactment framework. Key benefits of this platform are: -Fast and High-Throughput processing of millions of SNPs across thousands of samples. -Methods for quality control, analysis and annotation of Affymetrix and Illumina genotyping arrays. -Analysis reports with graphical summaries.
Keywords: genome wide association study, coronary artery disease
Poster J14
Gene-Environment iNteraction Simulator (GENS) for assessing the power of feature selection methods in complex diseases
Amato R. (1,4), D'Andrea D. (1,4), Miele G. (1,4), Nicodemi M. (1,5), Pinelli M. (2,4), Raiconi G. (3), Tagliaferri R. (3), Cocozza S. (2,4)
(1) Dipartimento di Scienze Fisiche, Università di Napoli "Federico II", (2) Dipartimento di Biologia e Patologia Cellulare e Molecolare "L. Califano", Università di Napoli "Federico II", (3) Dipartimento di Matematica ed Informatica, Università di Salerno , (4) Complex Disease Genetics Unit (CDGU), Università di Napoli "Federico II", (5) Complexity Science Center & Dept. of Physics, University of Warwick, UK
The etiologies of complex diseases are highly involved, with disease susceptibility likely influenced by multiple genes of small relative effect and environmental factors as well. Unfortunately, these complexities account for a modest success of existing statistical methods able to dissect the gene-environment interactions at the basis of common diseases. To this aim, we have implemented a workbench where several feature selection methods, like MDR, LDA, Stepwise Logistic Regression etc. have been challenged versus case/control populations provided by a biologically realistic simulator (GENS).
In the past two decades, many genes implicated in monogenic diseases have been identified by using genetic linkage and positional cloning methods. Although these methods have been remarkably successful, they have not been capable in identifying genes that are involved in the complex forms of disease. This failure could also be ascribed to the aetiology of most common diseases, that involves not only discrete genetic and environmental causes, but also interactions between the two. Although the concept of gene-environment interaction is central for ecogenetics, and has been recognized by geneticists since a very long time, studies in this area have primarily examined the relationship between genetic factors and traits, without considering environmental determinants. The study of gene-environment interactions could be useful for several reasons. First, if we only estimate the separate contributions of genes and environment, ignoring their interactions, we will incorrectly estimate the proportion of the disease (the population attributable risk) that is explained by genes, environment, and their combined effect. Second, the identification of gene-environment interactions provides direct evidence that biological pathways involved are relevant to specific traits allowing further focused researches. Third, understanding gene-environment interactions might allow us to give tailored preventive advice before disease diagnosis, and moreover to offer personalized treatment after the disease has been diagnosed. Ultimately, from an epidemiological point of view, not considering the gene-environment interaction could miss high risk interaction. In fact, it has been shown that also if each of two factors has a low effect on the disease risk for itself, taken together they can result in high risk interaction. Thus, a low relative risk for single genetic marker does not imply the irrelevance of the genetic marker, since it could be involved in an interaction with an environmental trigger. Despite a lot of information have been collected about both genetic and environmental risk factors, there are relatively few examples of gene-environment interaction in epidemiological literature. The main reason is that the majority of the studies have been designed to examine the main effect of single factors instead of examining the interactions. This is mainly due to the limitations of statistical methods which would require very large case-control data set to identify gene-gene and gene-environment interactions. Presently, the statistical methods mainly used to analyze factors interactions are the Stepwise Logistic Regression, Discriminant Analysis, and MDR, and most of them have not been specifically designed for this purpose. For this reason we expect that not all methods are equally sensitive to detect the whole phenomenon. Some of them, in fact, could be more prone to reveal additive behavior though could not detect epistatic or complex interactions, whereas others could be good to detect complex interactions, but fail to point out simple single factor effect. For this reason it would be important to determine the power of each specific method in this field of application. Hence, to accomplish this purpose one would need a considerably large number of trial data sets to test the methods response, for example against benchmarks or synthetic populations. We present an integrated framework designed to challenge different feature selection methods against simulated case/control population provided by a biologically inspired simulator. In particular, Gene-Environment iNteraction Simulator (GENS) is a software generating synthetic populations for case-control studies where the gene-environment interaction causing the disease is perfectly known. In these populations, individuals are marked by a set of environmental and genetic factors. However, only one gene and one environmental exposure will be implied in the disease occurrence while the other behave as confusing background. One can use GENS to evaluate the statistical power of each method of analysis. This is performed by generating case/control populations having the same underlying interaction model, but with different strength of the interaction and/or different size. This means the production of a wide set of artificial populations obtained by varying several epidemiological parameters in realistic ranges. Upon all these simulated populations, a feature selection analysis is automatically performed in order to characterize each method in terms of probability to select the correct feature involved in the simulated disease. It is worth stressing that this software has been tailored for the biomedical community, hence the main effort designing it was to respect the standard epidemiological biomedical parameters.
Keywords: Complex disease, feature selection, simulation
Poster J15
A proper machine learning approach to validate the HME Classification
Mordenti M. (1), Ferrari E. (2), Locatelli M. (1), Pedrini E. (1), Muselli M. (2), Sangiorgi L. (1)
(1) Genetic Unit - Rizzoli Orthopaedic Institute, Via di Barbiano 1/10, Bologna, 40136, Italy, (2) Institute of Electronics, Computer and Telecommunication Engineering, Italian National Research Council, Via De Marini 6, Genoa, 16149, Italy
Aim of the study is to validate the innovative classification for HME patients proposed by the Genetic Unit of the Rizzoli Orthopaedic Institute. This classification distributes the 233 patients considered in 3 classes and takes care both of the sites involved by exostoses, deformities and functional limitations and of the molecular screening of the EXT1 and EXT2 genes, identified as strictly involved in this disease. The collected data have been analyzed through a Switching Neural Network, a novel connectionist model, whose training is based on the synthesis of positive Boolean functions.
Starting from March 2003, the Genetic Day Clinic of the Rizzoli Orthopaedic Institute has evaluated 384 patients affected by Hereditary Multiple Exostoses. HME is an autosomal dominant rare disease, characterized by the presence of multiple osteochondromas (exostoses), whose number and effects can vary significantly between and within families. An exostosis is a bony neoplasm capped with a cartilaginous surface. Each exostosis develops and increases in size during childhood and stops growing at the puberty. The more frequently affected sites are metaphyseal regions of the long bones. However, osteochondromas also occur in flat bones. In a small amount of patients (<5%) an exostosis during adulthood could undergo a malignant transformation to Peripheral Chondrosarcoma. The EXT1 (8q24.1) and EXT2 (11p12) genes were identified as directly involved in Multiple Osteochondromas and the 86% of the patients have a mutation in one of the two genes. It has been set a diagnostic iter, composed of two defined steps. The first is a clinical approach, performed by an orthopaedic examination (usually an adult patient is followed up once a year and a child twice) and a x-rays valuation. The second one is a genetic analysis set, performed by DHPLC & Direct Sequencing. These two steps are both focused on the same target: the genotype-phenotype correlation. To perform such a study, an innovative clinical classification of the HME cases, realized in collaboration with the Orthopaedic Unit, has been proposed. Class I - No deformities – No functional limitations IA ≤ 5 sites with exostoses IB > 5 sites with exostoses Class II - Deformities – No functional limitations IIA ≤ 5 sites with deformities IIB > 5 sites with deformities Class III - Deformities – Functional limitations IIIA 1 site with functional limitation IIIB ≥1 site with functional limitations All the patients are divided in three clinical classes according to the number of affected sites and to the presence/absence of deformities and functional limitations. Aim of the study is to test and validate the effectiveness of the suggested classification. An innovative machine learning approach has been adopted to verify the suggested classification. In particular the collected clinical data has been used to generate a set of rules which assign to each patient his corresponding class. The dataset S considered to this aim was composed of 233 patients, selected from the starting number of 384 for the entirety of the data. 67 patients belong to the class I, 120 to the class II and 46 to the III. These patients have been characterized by 57 variables (50 of which are Boolean), defined after two preliminary analysis (150 variables for the first preliminary analysis and 97 for the second one). The variables used were related to the site involved (localization, side, etc.), the severity (type of limitations, pain, etc.), the genetic mutation (gene, type, etc.) and other relevant quantities (age, weight, etc.). The dataset has been analyzed using the Switching Neural Network (SNN) approach. SNN is an innovative connectionist model based on the synthesis of positive Boolean functions. According to this model, input variables are mapped into a Boolean domain and a positive Boolean function is built for each class, starting from the converted training set . The mapping into is obtained by determining a finite partition for each continuous input (discretization) and by applying a proper binary coding (inverse only-one coding) that preserves ordering and distance. To keep low the computational burden while maintaining a good level of accuracy, the synthesis of , which represents the kernel of the method, can be performed through one of two proper techniques: the Shadow Clustering (SC) algorithm, which performs a heuristic search in the implicant space, and the Switch Programming (SP) method, which is based on the solution of a proper integer linear programming problem. Preliminary results show that in the analysis of HME data SP achieves a better accuracy with respect to SC and will be therefore used to retrieve a consistent SNN. An interesting feature of the SNN model is that it can be easily transformed into a set of intelligible rules involving the input variables. Each rule can be characterized by a quantitative value determining its relevance in the description of the training set. Moreover, a measure of the relevance for each input variable can be provided, allowing the identification of fundamental and redundant attributes for the problem at hand. Two different analyses have been performed on the HME data. First, the examples have been analyzed through a ten fold cross-validation in order to verify the generalization ability of the resulting classifier. The average number of rules for the ten SNN was 43.4, whereas the average accuracy amounts to 82.07%; this is a very good result that proves the consistence of the proposed classification. Then, the whole dataset has been employed to train a final SNN, whose set of rules has allowed to evaluate the statistic relevance of each input variable. In this way, it has been obtained that SNN considered as more relevant the following three quantities: the limitation of the hip extra-rotation, the Madelung deformity and the valgism of the ankle. The reduction of the hip extra-rotation is a very common limitation and is useful to define class III, the same holds for the Madelung deformity, which is the most common deformity and defines the class II. The valgism of the ankle is really frequent in all the classes and it is an important point to account for in the overall classification. Sex and familiarity are in the top ten variables, confirming literature data. It is interesting to note that the variables effectively considered significant during clinical valuation and the ones obtained by DHPLC and Direct Sequencing analysis are the same that the SNN method underlines more.
Keywords: Genotype-Phenotype correlation, Multiple Exostoses, Switching Neural Networks
Poster J16
Optimization of BRC peptide for the inhibition of the human Rad51 protein filament formation
Nomme J. (1,2), Renodon-Corniere A. (1,2), Tran V. (1,2), Takahashi M. (1,2)
(1) CNRS UNR 6204, (2) Université de Nantes
Our work focuses on the design of anticancer peptide drug substances that specifically target the activity of the human Rad51 protein, a protein implied in homologous recombination. The strategy we have adopted has been to design peptides that inhibit the polymerization process of Rad51, which is essential for its activity. To do this, in designing these new molecules we have used a combination of bioinformatic techniques, such as molecular modeling calculations (molecular mechanics, minimizations and graphics), to back up biochemical methods and in-vitro/in-vivo tests of activity.
Our work focuses on the design of peptidic anticancer drugs directed against Rad51 protein. The human Rad51 (HsRad51) plays a crucial role in the homologous recombination by catalysing strand exchange between two DNA molecules of identical sequence. HsRad51 is thus involved in DNA repair and DNA segregation for cell division, and related to the resistance to radio- and chemo- therapies of cancer cells and their proliferation. HsRad51 is an interesting target for anti-cancer treatment. Pellegrini and colleagues performed crystallographic analyses of HsRad51 in complex with one of 8 BRC motifs (BRC4) of BRCA2 tumor suppressor and proposed that the peptide could interfere with the filament formation of HsRad51(Pellegrini et al. 2002). The peptide is, therefore, a potential inhibitor of HsRad51. We have experimentally showed that the 28 amino acid peptide derived from BRC4 motif interacts with the subunit-subunit interface of HsRad51 and prevents its filament formation on the DNA, the first step of strand exchange reaction. To improve the efficiency, we first searched, using Molecular Modeling calculations, important residues in the peptide for the inhibition and determine the minimum size of peptide. By shortening the peptide size and replacing some residues with alanine, we found that phenylalanine residues at position 1524 and 1546 are important for the inhibition, and that the minimum native size is 23 amino acids. We then searched the best amino acid sequence for the HsRad51 inhibition. For this purpose we built a model of each BRC motif of BRCA2 in the complex with HsRad51 based on the crystallographic structure of HsRad51-BRC4 complex and computed the binding energy of each residue in a motif. We then compared the result of all motifs and chose an amino acid, which provides the best binding energy at a given position, as the best amino acid at this position. The study proposed substitution of 3 amino acids in the BRC4 peptide. The experimental analysis supported the model building: the modification improves the efficiency about 10 times. References. Nomme J., Takizawa Y., Martinez S., Renodon-Corniere A., Fleury F., Weigel P., Yamamoto K.-i, Kurumizaka H. & Takahashi M. 2007. Inhibition of filament formation of human Rad51 protein by a small peptide derived from the BRC motif of the BRCA2 protein. Genes Cells (in press) Pellegrini L, Yu DS, Lo T, Anand S, Lee M, Blundell TL, Venkitaraman AR. 2002. Insights into DNA recombination from the structure of a RAD51-BRCA2 complex. Nature 420(6913):287-93. Corresponding author: julian.nomme@etu.univ-nantes.fr
Keywords: BRCA2, BRC motif, Filament formation, Rad51 protein, Recombinational repair, Peptide optimization
Poster J17
Lung cancer DSA: A platform for discovery of biomarkers in Lung cancer
Gavin R. Oliver (1), Austin Tanney (1), Vadim Farztdinov (1), Richard D. Kennedy (1), Jude M. Mulligan (1), Ciaran E. Fulton (1), Susan M. Farragher (1), John K. Field (2), Patrick G. Johnston (3), D. Paul Harkin (1), Vitali Proutski (1), Karl A. Mulligan (1)
(1) Almac Diagnostics, (2) Roy Castle Lung Cancer Research Programme, The University of Liverpool Cancer Research Centre, (3) Centre for Cancer Research and Cell Biology, Queen's University of Belfast
Non-small cell lung cancer (NSCLC) is the leading cause of cancer mortality. It is a subject of extensive research, however genomics tools used in this research are lacking in disease focus and thus are likely to miss potentially vital information contained in patients’ tissue samples. We have characterised the transcriptome of NSCLC and used this information to create a unique disease focused microarray - Lung Cancer DSA research tool. The tool allows for interrogation of ~60,000 transcripts relevant to Lung Cancer, tens of thousands of which are unavailable on leading microarrays.
Non-small cell lung cancer (NSCLC) is the leading cause of cancer mortality worldwide with poor differential diagnosis of the disease and with low response rates to standard chemotherapy treatment. It is therefore a subject of extensive research focused on identification of reliable genomics biomarkers to aid in accurate classification of the disease, predicting its progression and patients’ response to both available therapies and those in development. Powerful genomics tools used in this research are however lacking disease focus and thus are likely to miss potentially vital information contained in patients’ tissue samples. Through a combination of large-scale in-house sequencing, gene expression profiling and public sequence and gene expression data mining we have characterised the transcriptome of NSC lung cancer and used this information to create a unique disease focused microarray - Lung Cancer DSA research tool. Built on the Affymetrix GeneChip platform the tool allows for interrogation of ~60,000 transcripts relevant to Lung Cancer, tens of thousands of which are unavailable on leading commercial microarrays. Presented here are the array design process and the results of experiments carried out to demonstrate the array’s utility for use in biomarker discovery projects with using NSCLC and normal samples.
Keywords: lung cancer, biomarker, microarray, transcript
Poster J18
Multiple instance learning allows MHC class II epitope predictions for alleles without experimental data
Nico Pfeifer, Oliver Kohlbacher
Division for Simulation of Biological Systems, Center for Bioinformatics Tuebingen, Eberhard Karls University Tuebingen, 72076 Tuebingen, Germany
The binding of external peptides to MHC class II is one of the key steps of the adaptive immune system. Reliable methods for predicting peptide binding are available for less than 7% of all known MHC class II alleles because of the lack of sufficient experimental data. We are able to build predictors for about two thirds of all MHC class II alleles without requiring binding data of the target allele. Predictions on 14 test alleles on a comparative benchmark dataset show very good performance compared to other predictors.
Motivation: In human there are two arms of the immune system. One of them is the innate immune system and the other one is the adaptive immune system. The adaptive immune system uses several signals for directing immune responses according to foreign organisms. The major histocompatibility complex II (MHCII) presents foreign peptides on the membrane of antigen presenting cells which are internalized from outside of the cells. The recognition of these peptides by T-cells triggers a specific immune response, which helps to eradicate the foreign organism. Since there are various different MHCII alleles and every allele just binds particular subsets of all possible peptides, it is important to know whether a peptide binds to the particular MHCII or not. As little experimental data exists, there is a need for reliable prediction methods for peptide binding. Since different MHCII alleles are common for different ethnic groups, it is important to be able to predict MHCII binding peptides for as many alleles as possible to reach the goal of personalized vaccine design. Current prediction models for peptide MHCII binding prediction are available for less than 7% of the known alleles. Methods: The binding grooves of the MHC class II are open which is on of the main differences between MHC class I and MHC class II. This is why the length of the binding peptides for MHC class II varies significantly (from 8 to more than 30 amino acids). Experimental studies showed [1] that there is a binding core of nine amino acids, which binds specifically to MHC class II. Unfortunately, the binding core for most of the binding peptides is unknown. This complicates the prediction of peptide binding for MHC class II. Instead of deciding for one core like all other methods in the field do, we use all possible putative binding cores and build a bag of binding cores for every peptide. These bags are then used in multiple instance regression [2] to train a nu-Support Vector Machine. For alleles with sufficient data, we use the normalized set kernel of Gaertner et al. [3] with an RBF kernel as the inner kernel function. Furthermore, we introduce a new kernel function, which weights different positions by a certain factor. This enables us to weight the positions of the binding core differently, according to their importance. Results: The normalized set kernel together with this new kernel function enables us to build predictors for about two thirds of all MHC class II alleles. Evaluations on a benchmark dataset of 14 alleles for MHC class II [1] shows that our method performs as good or better than all other methods in the field if there is sufficient data for the target allele. Furthermore our new method, which is not trained on any data of the target allele, performs comparable to the best methods in the field. 1.Wang, P., Sidney, J., Dow, C., Mothe, B., Sette, A., Peters, B.: A systematic assessment of MHC class II peptide binding predictions and evaluation of a consensus approach. PLoS Comput Biol 4(4) (Apr 2008) e1000048 2.Ray, S., Page, D.: Multiple instance regression. In: ICML '01: Proceedings of the Eighteenth International Conference on Machine Learning, San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. (2001) 425-432 3.Gaertner, T., Flach, P.A., Kowalczyk, A., Smola, A.J.: Multi-instance kernels. In Sammut, C., Hoffmann, A.G., eds.: ICML, Morgan Kaufmann (2002) 179-186
Keywords: MHC class II, Immunoinformatics, Machine Learning, Kernel
Poster J19
Expression profiling of medulloblastoma Cancer Stem Cells (CSCs) in comparison with normal neural stem cells (NSCs)
Pillai R. (1), Pala M. (1), Scintu F. (1), Caria S. (1), Corno D. (2), Galli R. (2), Bulfone A. (1)
(1) bio))flag Srl, Scientific Park of Sardinia, Cagliari (Pula), Italy, (2) Stem Cell Research Institute (SCRI)- San Raffaele Scientific Institute, Milan, Italy
In this study we've characterized different lines of CSCs in a Ptch+/- mouse model resembling the desmoplastic Medulloblastoma, in comparison with NSCs, isolated from neurogenic brain/cerebellume areas (SVZ and EGL). The expression profiling have been performed using Affymetrix chips. The probe sets were analyzed using Bioconductor in R environment, and the differentially expressed genes were represented by heat maps. We've also applied a bioinformatic system (Flag-trap) to identify putative secreted proteins and a computational method (GSEA) to discover significantly correlated pathway.
Medulloblastoma (MB) is the most common malignant brain tumor of childhood. It is thought to result from the malignant transformation of the neural progenitors in the developing cerebellum, but little is known about its molecular pathogenesis. In particular the desmoplastic variant seems to derive from the oncogenic alteration of the external granule layer (EGL) precursors, whereas the classic variant appears to originate from IV ventricle progenitors. In this study we focused on the characterization of cancer stem cells (CSCs) in a mouse model (heterozygous mutants for the Patch gene) resembling the desmoplastic variant of MB versus different lines of Neural Stem Cells. In particular we isolated CSC lines from three different murine MBs culturing under standard conditions (exposure to EGF plus FGF2 mytogens), and NSCs from neurogenic areas of the cerebellum (EGL) and the subventricular zone (SVZ), at different time points after birth. The expression profiling experiments have been performed using Affymetrix chips. The probe sets were ranked in a t-test analysis by the P value, adjusted using the false discovery rate (Benjamin and Hocberg, 1995), and genes (Pvalue < 0.005) that were significantly up-regulated or down-regulated were represented by heat maps generated using the complete linkage clustering method. We've also applied a proprietary bioinformatic system (Flag-trap) that is able to perform a high-throughput processing and analysis of the biological data to identify the conserved functional domains of putative surface and secreted proteins and a computational method (GSEA: gene set enrichment analysis) able to determine whether a set of genes shows statistically significant differences between the two biological states. The analysis of the differential gene expression patterns have provided clues on the origin of these tumors and may lead to the identification of new gene products or pathway to be targeted for future therapies.
Keywords: CSCs, NSCs, MB, SVZ, EGL, gene expression
Poster J20
Prediction of Human Disease Genes by Analysis of Conserved Coexpression
Ugo Ala (1), Rosario M. Piro (1), Elena Grassi (1), Christian Damasco (1), Lorenzo Silengo (1), Martin Oti (2), Paolo Provero (1), Ferdinando Di Cunto (1)
(1) Molecular Biotechnology Center, Department of Genetics, Biology and Biochemistry, University of Turin, Italy, (2) Department of Human Genetics and Centre for Molecular and Biomolecular Informatics, University Medical Centre Nijmegen, Nijmegen, The Netherlands
The identification of disease genes within orphan loci is a demanding task because they often contain hundreds of positional candidates. We present a method that identifies high-probability candidates among the positional candidates for a given disease as those that show significant evolutionary conserved coexpression with genes already known to be involved in similar phenotypes. Our results demonstrate that conserved coexpression (between human and mouse) represents a very strong criterion to predict human disease genes. We propose candidates for 81 OMIM loci of unknown molecular basis.
The identification of disease genes within disease-associated loci is a very demanding task even in the post-genomic era because orphan loci may typically contain hundreds of positional candidates. Most published methods for disease gene prediction or prioritization rely on accurate gene annotation information (e.g. Gene Ontology) or mine PubMed/MEDLINE abstracts to infer relations between genes and phenotypes, thus being strongly biased towards well-characterized genes and tending to overlook genes about which little is known. We present a method that exploits microarray gene expression data and a quantitative measure for similarity between human phenotypes to identify best candidates among the positional candidates for a given disease as those that show significant coexpression with genes already known to be involved in similar phenotypes. Since the method uses a notion of similarity among phenotypes it can also be applied to phenotypes of so far unknown molecular basis (as long as they show similarity to other phenotypes of known molecular basis). Also, avoiding information on gene annotation and previous research is potentially much less biased towards consolidated knowledge although current microarray platforms still have their limitations and hence do not allow the evaluation of all positional candidates. Since microarray data can be very noisy we focus on coexpression that is evolutionary conserved between human and mouse and therefore is more likely to be biologically meaningful. For this purpose we construct a human-mouse conserved coexpression network and verify its biological meaning and applicability to disease gene prediction by analyzing the prevalence of Gene Ontology terms, known interactions between human proteins and similar OMIM phenotypes within the networks coexpression clusters. Our results demonstrate that conserved coexpression, even at the human-mouse phylogenetic distance, represents a very strong criterion to predict disease-relevant relationships among human genes. We propose high-probability candidates for 81 OMIM loci characterized by unknown molecular basis.
Keywords: disease genes, conserved coexpression
Poster J21
The Italian Network for Oncology Bioinformatics
Romano P. (1), Crescenzi M. (2)
(1) National Cancer Research Institute, Largo Benzi 10, Genova, Italy, (2) Italian National Institute of Health, Viale Regina Elena 299, Roma, Italy
Oncology research and clinics depend on data analysis and integration and on bioinformatics tools and expertise. Italian Cancer Comprehensive Centers did not yet develop properly resources, expertise and skills in Bioinformatics. The Italian Network for Oncology Bioinformatics has been funded with the objective of setting up an effective coordination of bioinformatics activities lead by partners. Concrete objectives refer to service provision, staff training, inter-institutional collaboration and identification of new ideas and projects, both for basic and for translational research.
Introduction It is well known that biomedical research will more and more depend on the analysis of all available information. Genomics and proteomics depend nowadays on high-throughput technologies and already heavily rely on automatic data analysis. New research areas are also emerging, like Clinical Bioinformatics, a discipline that tries to connect molecular and clinical information with the aim of orienting diagnosis and treatments towards personalized medicine. Clearly, oncology research and clinics also strictly depend on data analysis and integration and on bioinformatics tools and expertise. As a consequence, bioinformatics will become one of the most important research instrument for improving and making biological data analysis efficient. Italian Cancer Comprehensive Centers (CCCs, Istituti di Ricovero e Cura a Carattere Scientifico Oncologici, IRCCS) did not develop up to now a clear strategy in this field. Resources, expertise and skills on Bioinformatics are still lacking, but for some good exceptions. Instead, many International institutes for research on cancer already developed adequate infrastructures. This is the case, e.g., of the DKFZ, Heidelberg and of the CNIO, Madrid, not to mention the Center for Bioinformatics of NCI, Bethesda. Italian CCCs must therefore raise their bioinformatics skills and expertise up to a level that is adequate to biomedical research of next years. The alliance of Italian CCCs “Alleanza contro il Cancro” (ACC) includes about 10 Institutes none of which has the sufficient critical mass that all centers together can, instead, easily reach, For this reason, a coordination and cooperation network is the best option for achieving a thorough dialogue between bioinformaticians, biologists and medical doctors, an effective knowledge transfer between Institutes, the exploitation of tools and results of single Institutes and the capacity of facing next years issues. The Italian Network for Oncology Bioinformatics – RNBIO has therefore been funded for two years, during which it is planned that the basis for an effective coordination will be laid. In this poster, we present the objectives of the network, the methodology for their achievement, current activities and some initial results. Objectives The main objective of the Italian Network for Oncology Bioinformatics – RNBIO is the setting up of an effective coordination of bioinformatics activities lead by partner Institutes with the aim of integrating and improving current expertise. The concrete objectives of the network refer to service provision, staff training, inter-institutional collaboration and identification of new ideas and projects, both for basic and for translational research. The following goals are pursued: - promotion of innovative design and development methodologies and computer technologies, - coordination of research and development activities on topics of interest, - exploitation and promotion of tools developed by partners - pursuing of collaborations with Oncology Research Center of excellence and research networks and infrastructures, - pursuing of collaborations with Grid networks and High Performance Computing providers. Methods RNBIO will carry out many activities, including: - training courses on bioinformatics tools and ICT technologies - scientific seminars and workshops - working groups - collaborations with HPC infrastructures and Grid networks Training will include courses on the use of purpose tools and on bioinformatics programming skills. This twofold approach aims at improving skills both at the users’ level and at the developers’ level. The web site will aim at supporting both the promotion of the network and the project’s activities (coordination, collaborative development). Working groups are meant to be an important component of the network. Their main role is to put together expertise and skills of researchers and clinicians with various backgrounds, including bioinformaticians, biologists and medical doctors. They will constitute a place for discussion, confrontation, and brainstorming between people interested to the same issues in a multidisciplinary frame. Their activity will also be carried out by using collaborative development tools. Current status RNBIO started in October 2007 and devoted the first months of the project to its internal organization and the reference web site, that is available at http://www.rnbio.it/. The definition of working groups topics and the design of first training and scientific events were also done in these first months. The RNBIO web site was implemented by using Plone Open Source Content Management System and it also include a ZWiki implementation; all network partners have an account and can both provide their contributions to the site and modify existing ones. The web site currently includes a public and a private areas. The public area is devoted to the presentation of network members, documents produced by the network, information on training activities and software developed by members. Two ‘smart folders’ list announcements included in the site. A folder is devoted to collaborative development of texts by means of the Wiki approach. The private area includes folders for the collaborative development of documents related to the activity of the working groups. Partners are defining aims and objectives of the following working groups: - Automation of in-silico data analysis processes - Statistical methods for the analysis of molecular profiles - Comparative sequence analysis - Oncogenomics - Oncoproteomics - Structural bioinformatics (linked to ESFRI INSTRUCT initiative) The first planned course is an introduction to the R statistical language. The NETTAB 2008 workshop will include a session devoted to Oncology Bioinformatics where members will present their expertise and recent research to a wider scientific audience of bioinformaticians, biologists and computer scientists.
Keywords: oncology bioinformatics, education, automation of processes, research netwok
Poster J22
Microarray Meta-Analysis Highlights Neuro-Immune Signaling in Parkinson's Disease Patients
Lilach Soreq (1), Zvi Israel (2), Hagai Bergman (1,3), Hermona Soreq (4,3)
(1) Department of Physiology, The Hebrew University-Hadassah Medical School, Jerusalem, Israel, (2) Department of Neurosurgery, Hadassah University Hospital, Jerusalem, Israel; , (3) Inter-disciplinary Center of Computational Neuroscience;, (4) Department of Biological Chemistry, The Institute of Life Sciences, The Hebrew University of Jerusalem,
No blood test is yet available to detect PD thus identification of blood PD biomarkers is highly important. We combined application of statistical, mathematical and gene ontology analyses to reexamine microarray data from nucleated blood cells of early PD patients, matched healthy and neurological disease controls. Outliers and scan date batch effect were detected and corrected thus improving the discriminative power for PD. Our findings point at neuroimmune signaling related transcripts as distinctly expressed in early PD and call for exploiting microarray tests for follow-up of PD treatment.
Laboratory tests for Parkinson's disease (PD) were recently extended to microarray analyses of nucleated blood cells. No blood test is yet available to detect early PD, and the nature of the peripheral changes involved it not yet clear. Therefore, identification of biomarkers for early PD in the blood is highly important. Here we report combined application of statistical, mathematical and gene ontology analyses to re-examine microarray data from 105 early PD patients and matched controls. Both outlier arrays and scan date batch effect were detected. Distribution plots and PCA mapping enabled correction of these errors, which improved the discriminative power for PD blood cells compared to healthy and neurological disease controls. Combined with gene ontology tests, our findings point at neuro- immune signaling-related transcripts as distinctly expressed early in PD progress and call for exploiting microarray tests also for follow-up of PD treatment efficacy.
Keywords: microarray Parkinson's disease blood gene ontology batch effect
Poster J23
Computational analysis of in vitro screening data highlights an atypical cytostatic mechanism of a cytosine derivative
Fran Supek, Marijeta Kralj, Biserka Zinic, Tomislav Smuc
Rudjer Boskovic Institute
An interdisciplinary approach, combining screening for cytostatic activity on a human tumor cell line panel, flow cytometry and machine learning methods (self-organizing maps and the Random Forest classifier) indicates that the pyrimidine derivative 1-tosylcytosine may have a biological mechanism of action atypical for nucleobase derivatives.
Previously, pyrimidine nucleic base derivatives with a sulfonamide pharmacophore have been indicated as potential antitumor agents. We have shown that N-1-sulfonylpyrimidine derivatives have strong antiproliferative activity on human tumor cell lines, where 1-(p-toluenesulfonyl)cytosine (TsC) in specific was shown to have a selective effect with regard to normal cells and was easily synthesized on a large scale. Past experiments using radio-assays of enzyme activity have indicated that TsC induces a general shutdown in the cellular DNA, RNA and protein biosynthesis. In the present work we have used an interdisciplinary approach to further elucidate the compounds’ mechanistic class. Primarily, we have employed an augmented number of cell lines (eleven), of which eight overlapped with the DTP-NCI screening panel, and one non-transformed human fibroblast cell line. This has allowed us to computationally search for compounds with similar activity profiles and/or mechanistic class by integrating our data with the comprehensive DTP-NCI database; a permutation testing procedure was used to estimate statistical significance of the matches. We have applied supervised data mining methodology (a Random Forest classifier), allowing us to get a prediction of mechanism of action, along with estimates of predictive reliability. When using only a subset of the full DTP-NCI 60 cell line panel, this approach may complement the information obtained from self organizing maps (SOM), a method commonly used in examinations of cytostatic activity profiles. Finally, we have performed cell cycle perturbation and apoptosis analysis of the most sensitive cell line (MCF-7), which has shown marked G1 phase arrest accompanied with the reduction of the number of cells in S phase. As expected, TsC did not alter the cell cycle of normal cells. Our results point to an unusual mechanism of cytostatic action, possibly a combination of nucleic acid antimetabolite activity and a novel molecular mechanism. We hypothesize that the novel mechanism might be similar to the activity of benzothiazoles, previously described as involving the aryl hydrocarbon receptor, activation of CYP1A1 and CYP1B1 genes and a subsequent DNA damage response. Our work has been published in the “Investigational New Drugs” journal, Volume 26 (April 2008), pages 97-110, available from http://dx.doi.org/10.1007/s10637-007-9084-1
Keywords: nucleobase, antitumor compound, cell line screen, random forest
Poster J24
In Silico Study of wild-type and Raltegravir-selected Mutants of HIV-1: Structural and DNA Recognition Properties
Tchertanov L., Mouscadet J.-F.
LBPA, CNRS, Ecole Normale Supérieure de Cachan, 61 av. Président Wilson, 94235 Cachan, France
In silico study of the wild-type HIV-1 IN and its mutants was performed in order to determine the molecular effects triggered by the mutations leading to resistance to raltegravir. The most important difference between the WT and mutated INs is related to their recognition by DNA bases. We propose that raltegravir is a bio-isostere of adenine which acts by competing with DNA for residues N155 and/or Q148. In order to thwart this inhibitory effect, the virus may have to select mutations that maintain the integrity of IN structure while allowing alternative modes for DNA recognition.
The HIV-1 integrase (IN) catalyses the integration of the viral cDNA into the host cell chromosome and therefore has a great potential as a target for anti-HIV drugs. Raltegravir is the first effective antiretroviral agent belonging to the novel class of HIV-1 IN inhibitors. Raltegravir resistance was associated with two genetic pathways defined by mutations at either N155H or Q148H/R/K which reduce susceptibility by 10 to 25 fold, respectively. Both in vitro and in vivo, the evolution of additional mutations resulted in high level resistance (Q148H/R/K plus E138K, G140S/A). While certain secondary mutations appear to counteract replication defects associated with specific N155 or Q148 mutations, all secondary changes studied resulted in increased resistance. These results suggest that a single mutation may be not sufficient to confer full resistance. The substitution is considered as the key of the HIV-1 integrase drug resistance because it significantly lowers the binding affinity of integrase inhibitor for its nucleoprotein target. Consequently, comparing the recognition properties and affinity of IN inhibitors for either native or mutated INs are crucial determinants for the comprehension of the resistance mechanism and subsequently the key in the development of novel inhibitor candidates. Our study focused on the identification of the molecular effects induced by the mutations leading to resistance to raltegravir. We performed in silico study of structural features and recognition properties of both wild-type and mutated INs. We performed 2D prediction and 3D molecular modelling: (i) to establish the folding of catalytic loop of the IN; (ii) to probe an influence of the Mg2+ co-factor and the mode of its binding on the catalytic core structure; and (iii) to study the structural effects of the drug-induced mutations. 2D Prediction was made by using neutral network and nearest-neighbor approaches. 3D models were generated from the crystal structures by homology modeling or by amino acid replacement (SYBYL Molecular Modelling Software). The DNA bases recognition by the wild-type and mutated residues were analyzed by using the PDB and IsoStar databases and the results are represented as 3D maps of the sidechains distribution around DNA base pairs, AT and GC. We observed that the native and mutated IN show a perfect similarity for general enzyme folding. The N155H mutation has not influences the catalytic site structure, triggering only local conformational reorganization. We found that the 140-149 catalytic loop is characterised by a striking conservation of a structural element involving 144-148 residues, the omega-shaped hairpin, stabilised by multiple H-bonding spanning across the loop. Folding of the hairpin originates in the strong conformational preferences of N, P, Q, S and G to form such turns. We proved that this hairpin is topologically invariant with respect to the Mg2+ population, the mode of Mg2+ binding in the catalytic site and the raltegravir induced mutations. The hairpin is displaced from 16 to 4.5 Ǻ towards the active site as a rigid body in a gate-like manner. Our modeling trials of the wild-type and resistant enzymes allowed to evaluate the structural effects induced by the key raltegravir-resistant mutations. We conclude that the N155H (where His is N epsilon-2H tautomer) mutation allows a strict conservation of the active site structure. In a sharp contrast with the important structural changes of the catalytic loop induced by engineered mutagenesis modifications G140A/G149A, the raltegravir-selected mutations G140S and Q148R/H/K conserve all structural features of the wild-type catalytic loop. Such results could indicate the importance for catalytic loop to conserve its high degree of flexibility and a crucial functional role of the omega-shaped hairpin on the catalytic loop as a structural element; the virus selects the mutations which permit to maintain all the functions of the catalytic loop.The most important difference between the wild-type and mutated IN is related to their recognition by DNA bases. The native N155 and Q148 show a clear preference for binding with adenine, interacting by pair of strong H-bonds with two binding sites at the major groove. The mutated R, K and H strongly favor pyrimidines. Furthermore, we observed that the secondary mutation G140S which is readily observed following the selection of the primary Q148 mutation, modified the mobility of the catalytic loop, thereby favoring the change of base specificity induced by Q148R/H/K mutation. Molecular recognition of the DNA bases by the wild type and mutated residues shed light on the specificity of IN-DNA interactions. Integrase must be able not only to recognize the DNA but also to discriminate between the DNA bases. Our study suggests that raltegravir is a bio-isostere of adenine which acts by competing with DNA for residues N155 and/or Q148. In order to thwart this inhibitory effect, the virus may have to select mutations that maintain the integrity of IN structure while allowing alternative modes of DNA recognition. An obvious application of our models is their use for docking to guide the placement of the DNA into its receptor site on IN. The spatial orientation of the sidechains around the DNA bases can be correlated with the mapping of different frames from a molecular-dynamics run. The obtained data might offer opportunities for designing novel HIV-1 IN inhibitors that would retain anti-viral activity against the emerging HIV-1 mutants.
Keywords: HIV IN, mutant 3D, hairpin, DNA recognition
Poster J25
Gene prioritization through genomic data fusion: algorithm and applications.
Leon-Charles Tranchevent (1), Stein Aerts (2), Bernard Thienpont (3), Peter Van Loo (4), Shi Yu (1), Bert Coessens (1), Roland Barriot (1), Steven Van Vooren (1), Bassem Hassam (2), Yves Moreau (1)
(1) Department of Electrical Engineering ESAT-SCD, Katholieke Universiteit Leuven (Belgium), (2) Department of Human Genetics, Katholieke Universiteit School of Medicine, Leuven (Belgium); Laboratory of Neurogenetics, Department of Molecular and Developmental Genetics, VIB, Leuven (Belgium), (3) Center for Human Genetics, Katholieke Universiteit Leuven (Belgium), (4) Department of Electrical Engineering ESAT-SCD, Katholieke Universiteit Leuven (Belgium); Human Genome Laboratory, Department of Molecular and Developmental Genetics, VIB Leuven (Belgium); Department of Human Genetics, Katholieke Universiteit School of Med
Genome-wide experimental methods to identify disease genes (such as linkage analysis and association studies) often generate large list of candidate genes from which only a few are interesting. Endeavour (http://www.esat.kuleuven.be/endeavour), a web resource for the prioritization of genes, indicates which genes are the most promising ones. Our approach relies on gene similarity; it is based on evidences that suggest that similar phenotypes are caused by genes with similar functions.
Genome-wide experimental methods to identify disease genes (such as linkage analysis and association studies) often generate large list of candidate genes from which only a few are interesting. Endeavour (http://www.esat.kuleuven.be/endeavour), a web resource for the prioritization of genes, indicates which genes are the most promising ones. Our approach relies on gene similarity; it is based on evidences that suggest that similar phenotypes are caused by genes with similar functions. Our algorithm consists of (i) inferring several models (based on various genomic data sources) from a training set of genes, (ii) applying each model to the candidate genes to rank them and, (iii) merging the several rankings into a final ranking of the candidate genes. Recently, we have extended Endeavour to make it a multiple-species tool. Nowadays, the tool supports Homo sapiens, Drosophila melanogaster, Mus musculus, Rattus norvegicus and Caenorhabditis elegans. As a first validation, we collected several gene-disease association recently reported in the literature (in Nature Genetics 2008). Then, we used Endeavour to prioritize the region that contain the disease genes and their 99 closest neighbours. Results show that, with data 6 month prior to the publication, Endeavour is able to rank the correct disease genes in the top 15%, 5 out of 7 are even in the top 5%. As a functional validation, Endeavour was used to optimize a genetic screen performed in Drosophila melanogaster. The aim was to find genes that interact physically with atonal, a proneural transcription factor involved in development. The regions outputted by the genetic screen as positive contain hundreds of genes of which only one or a few are the real genetic interactors. Twelve regions were prioritized and the most promising candidates were validated. Results showed that Endeavour rank the true interactors in the top 15% of the regions. We next applied this concept to the heart disorders. Starting from patients with heart defects for which the causal gene is unknown, regions of interest were defined using the array-CGH technology. Then, we prioritized them to find the most promising candidates for further experiments. The first validations show that Endeavour ranks the best candidates on top decreasing thus the cost of the validation. In conclusion, we present Endeavour, a framework that can prioritize selected candidate genes or whole genomes, in five major organisms, which was experimentally validated.
Keywords: gene prioritization, gene disease association
Poster J26
INNOVATION in bioinformatics education: tailor-made course material from high school to PhD
Van Gelder C. (1,2), Joosten R. (1), Guillard M. (1,3), Venselaar H. (1), Vriend G. (1)
(1) Centre for Molecular and Biomolecular Informatics, Nijmegen Centre for Molecular Life Sciences, Radboud University Nijmegen Medical Centre, PO Box 9101, 6500 HB Nijmegen, the Netherlands, (2) Netherlands Bioinformatics Centre, PO Box 9101, 6500 HB Nijmegen, the Netherlands, (3) Laboratory of Pediatrics and Neurology, Radboud University Nijmegen Medical Centre, PO Box 9101, 6500 HB Nijmegen, the Netherlands
Bioinformatics educators cope with students ranging from biology to mathematics. The CMBI works on bioinformatics education methods to reach all of those. The poster will show: * A bioinformatics education content management system, including bioinformatics specific features and personification possibilities. * Courses for many different student groups. * Bioinformatics@school (www.bioinformatics-at-school.eu), part of an education innovation award winning project, that brings bioinformatics practicals to thousands of high school pupils and their teachers. All material is freely available.
The target audience for bioinformatics education is very heterogeneous Bioinformatics is interdisciplinary in nature. Target groups for bioinformatics education therefore widely differ both in scientific background (e.g. biology, informatics, medicine, etc.) and level (BSc, MSc, PhD, high school, or even the general public). Bioinformatics is a discipline that seems well-suited for internet-based education (e-learning). The course material requirements for e-learning may differ greatly when compared to material used for in-the-class-room courses. Education efforts For more than ten years, we have been developing course material and have been giving courses to very different audiences, ranging from the general public to PhD students. Our efforts are directed towards making educational material available in a modular, flexible, and reusable manner. All our material, including teacher notes, is freely available, either as a complete course or as individual modules. Bioinformatics-specific content management system We have experimented with the implementation of course material in two commercial content management systems, with personalization of course material depending on the background of the students, and with multi-media (audio and video material of lectures) options. Problems encountered were e.g. the inability to make last minute changes (which, unfortunately, but unavoidably, is crucial in education), non-atomicity of the underlying material which precludes re-usage, and non-functioning export options (which are vital if you want to exchange course material between systems). We decided not to stick to one of the existing systems but to develop a flexible and easy-to-use education content management system, CMbiS (Content Management BioInformatics System). We implemented several courses in a very modular fashion (1). The level of the courses differs but most of them are part of the BSc and/or MSc curricula of bioinformatics, chemistry, molecular life sciences, and biology at the Radboud University Nijmegen. CMbiS contains the following features: * modularization of teaching material in educational units. * personalization for different target groups. * password protection of (the answers of) all individual parts of the material, allowing full flexibility towards the audience. * easy-to-handle for the teacher. * print-out possibilities (PDF). * export of the material in XML (in progress). * audio and video fragments of lectures, complicated concepts, and test-exam questions. * a Wiki (2) for bioinformatics educational support, which can also be used fully stand-alone and can easily be incorporated in other systems. This Wiki is also used extensively to make results of collaborative projects with medical professionals more easily accessible to them. Bioinformatics@school We are working on the use of the internet for education innovation. Examples include Mol4D (3) (web-based organic chemistry, European Academic Software Award in 2004) and SAMSAM (Students Access Molecular Structures And Modelling). More recently the project Bioinformatics@school (4) was initiated together with the Netherlands Bioinformatics Centre (5). The project is part of a national project “DNA labs on the road” (6,7) which in 2008 received a national education innovation award from the Netherlands Biotechnological Society. All the material is freely available through the website (4). The goal of bioinformatics@school is to bring bioinformatics to high school students and their teachers. Since 2006 our bioinformatics practical is going on tour to Dutch high schools to demonstrate that it is fun (!) to work with DNA, genes, and proteins, and that genomics and bioinformatics research have a link to themes in every-day life. Over 5000 pupils (200 classes) have already done this practical in their biology or chemistry classes. The practical and theoretical teaching material of bioinformatics@school contains several modules of about 45-60 minutes, which can be used together or separately (depending on the teacher’s needs): Bioinformatics theory modules: 1. The basics: Background material to freshen up your knowledge about DNA and proteins. 2. Bioinformatics: An introduction to this field of science. Bioinformatics in practice modules: 1. Murder at the airport: Become a Crime Scene Investigator and investigate the cause of death of an American tourist in Amsterdam (techniques used: MRS-BLAST(8), database searching). 2. 3D drug design: Use 3-dimensional protein models to design an antidote against a deadly poison (techniques used: Drug design, Yasara(9) 3D visualization). 3. Retinitis pigmentosa: See how a mutation can affect your vision (10). 4. EEC in 3D: Explore the protein structure of the p63 protein in the EEC disease (10) (technique used: Yasara). Bioinformatics “close to home” modules (10): 1. Bioinformatics shop: Reflection about products developed with the use of bioinformatics. 2. Tips & tricks for making an essay or science project about bioinformatics. References: 1. http://swift.cmbi.ru.nl/teach/courses/. 2. http://wiki.cmbi.ru.nl. 3. http://www.cmbi.ru.nl/wetche/organic/. 4. http://www.bioinformatica-in-de-klas.nl (Dutch version) and http://www.bioinformatics-at-school.eu (English version). 5. http://www.nbic.nl. 6. http://www.dnalabs.nl and http://www.dnalabs.eu. 7. http://scienceinschool.org/2007/issue6/dnalabs/. 8. MRS: A fast and compact retrieval system for biological data. Hekkelman M.L., Vriend G., Nucleic Acids Research 2005 33(Web Server issue):W766-W769, http://mrs.cmbi.ru.nl. 9. Yasara (Yet Another Artificial Reality Application), http://www.yasara.org. 10. Module available in Dutch through www.bioinformatica-in-de-klas.nl, english version under construction on www.bioinformatics-at-school.eu.
Keywords: Bioinformatics education, courses, CMS
Poster J27
Consensus Filtering of Narrow DNA Aberrations in SNP Array Data: Proof of Principle
Gerard Wong (1,2), Christopher Leckie (1,2), Ian Campbell (2,3), Kylie Gorringe (3), Izhak Haviv (3), Adam Kowalczyk (1,2)
(1) NICTA, Victoria Research Laboratory, Parkville, Victoria, Australia., (2) The University of Melbourne, Victoria, Australia., (3) Peter MacCallum Cancer Centre, East Melbourne, Victoria, Australia.
Copy number aberration and loss of heterozygosity are forms of DNA aberration commonly observed in cancer. We propose a novel statistical approach to assess consistency across multiple samples to reveal these changes. We utilize rank calibration across the whole genome, within each chromosome and chromosome-arm as well as bipolar calibration to mitigate the influence of noise. The results of our approach on lung adenocarcinoma data highlights statistically significant narrow regions of aberration as well as differences across phenotype subgroups such as gender, tumour grade and stage.
Motivation: The ability to examine the human genome at high resolution has been enhanced with the introduction of microarray technology with inter-probe distances of less than 1 kilobase in recent releases of the Affymetrix SNP arrays. Mutations in the genome that lead to uncontrolled cell growth and replication are hallmarks of cancer. The ability to identify these mutations will assist our understanding of the pathogenesis of cancer, particularly if they are consistent across multiple tumour samples. Narrow regions of change in the human genome often go undetected as algorithms tend to regard individual outlying points as noise and exclude them from the analysis. We address the presence of noise at the sample level with various calibration techniques and compute a set of independent statistics to elucidate consensus change across the genome down to the resolution of a single probe. Results: Applying our methodology to the Tumour Sequencing Project dataset on lung adenocarcinoma (LA dataset, Weir et al.), we are able to detect many hundreds of narrow consensus peaks sitting above the Bonferroni-correction threshold. Many identified peaks reside in regions of widely-implicated oncogenes and tumour suppressor genes. We have also identified novel regions of aberration prompting the need for further biological verification. Our results also show examples of differential peaks between phenotypes, most notably between gender, which agrees with known clinico-pathological gender differences in lung cancer. Differences between tumour grades and stages in lung adenocarcinoma were also highlighted. Outline: A number of methods have been proposed to detect regions of significant copy number aberration and loss of heterozygosity (LOH), including GLAD (Hupe et al.), GISTIC (Beroukhim et al.) and Hidden Markov Model-based methods among others. To our knowledge, these methods focus predominantly on the detection of a small number of copy number change points in individual samples before progressing to analysis across multiple samples. In order to remove noise in the data, the small regions of copy number change spanning a few probes that are inconsistent with the readings for the wider neighborhood are usually “smoothed out'' or omitted from further analysis. For instance, GLAD and GISTIC based analysis of the LA dataset explicitly removed segments less than eight SNPs in length (Weir et al.). As a result, only a small number of regions of copy number change are accepted (in the order of tens only). In contrast, our approach concentrates on detection of micro-regions of aberration, down to the resolution of a single SNP probe. Obviously, we cannot rely on consistency within a single sample for combating noise, so instead we utilise consistency across multiple samples as the basis of our filtering technique. This approach involves the computation of a set of independent statistics across multiple samples to elucidate concordant regions of copy number change and loss of heterozygosity with estimated p-values significantly smaller than the Bonferroni correction threshold, a conservative correction for multiple testing. The initial validation of our method reported here focuses on LA dataset. We have found DNA regions of significant aberrations that overlap with results in (Weir et al.) and also some novel regions of significant aberrations. Our results show that we are able to detect regions of statistically significant micro-deletion and micro-amplification, down to the resolution of single probe. In particular, our approach has identified significant differential peaks between phenotypes such as tumour grade, tumour stage and gender for LA samples. Several positively identified SNP with paralogs on chromosome Y provided us with natural positive controls demonstrating the sensitivity of our approach for the detection of copy number change in narrow, single probe regions. A number of identified LOH consensus peaks in our analysis were found upstream of genes, which could be indicative of loss of heterozygosity in promoter regions. These results need to be experimentally validated. The significance of our methodology was independently corroborated by analysis of synthetic data and through the use of an expanded ovarian cancer dataset from (Gorringe, et al.). Conclusion: While other approaches such as GISTIC are driven by the amplitudes and frequency of a limited percentage of samples, we take a complementary approach to provide a consensus analysis across all samples in identifying significant narrow regions that are consistently amplified or deleted in the sample space. Our results are largely orthogonal and complementary to all methods for copy number analysis known to us, which often regard micro-regions of change as outliers and exclude them from their analysis. However, we argue that statistically significant micro-regions can be identified from analysis across multiple samples. These micro-regions of aberration could be indicative of concealed biology which may not have otherwise surfaced through the application of other techniques. References R. Beroukhim, et al., Proc Natl Acad Sci U S A, 104(50):20007–12, 2007. K. L. Gorringe, et al., Clin Cancer Res, 13(16):4731–9, 2007. P. Hupe, et al., Bioinformatics, 20(18):3413–22, 2004. B. A. Weir, et al., Nature, 450(7171):893–8, 2007.
Keywords: Copy Number, LOH, Aberration, Statistics