Systems Biology of Drug Perturbations

since 2010Carried out at EMBL


illustrationCombining and integrating different types of data (e.g., molecular in vitro data, data from cell-based assays, complex phenotypic data such as side effects), we aim at gaining a better understanding of drug mechanisms of action. Ideally, this will aid the interpretation of drug action in many contexts, from the molecular level (for instance, drug-target interaction) to the whole organism (e.g., establishing links between side effects and cellular pathways: Brouwers et al. PLoS One, 2011; see also Iskar et al. Curr. Opin. Biotechnol., 2011). To this end, we develop methods to predict protein targets, response pathways or side effects of drugs from complex cell-based assays or organism-scale data and analyze these predictive models to reveal the underlying biological mechanisms. We see this as rational approaches to drug repositioning and in silico drug safety assessment.

Currently, we focus on cellular assays of expression changes upon chemical perturbations: The Connectivity Map records gene expression (for >10,000 genes) in several cell lines for hundreds of small molecules. Analysis of such complex multi-parametric read-outs requires intelligent data reduction and feature selection techniques in order to derive predictive molecular signatures of drug action. Ideally, these molecular signatures will facilitate interpretation in a cellular context and help establish links to organismal phenotypes.

Metagenomic Markers of Host States

since 2010Carried out at EMBL


Metagenomics projects targeting the microbiota living in and on the human body seek to reveal interactions between states of the host (such as health status, dietary habits, age, body mass index, etc.) and its commensal microbes. The composition of gut microbiota can markedly differ between individuals, but there appear to be only few equilibrium states ("enterotypes"), which can be revealed by clustering (Arumugam et al. Nature, 2011). However, the biological role of these enterotypes and how they relate to host properties is less clear. Together with many colleagues from Peer Bork's group and the MetaHIT consortium, I work to derive signatures from the abundances of certain microbes or from functions encoded in their metagenome that are predictive of host states. To this end, I am experimenting with regression and classification techniques that are well-suited for feature selection and interpretion.

Gene Finding

since 2006Carried out at FML


illustrationIn Gunnar Rätsch's group we developed a novel and very accurate gene finding system called mGene, which uses the latest advances in machine learning, namely a discriminative structure prediction technique called hidden semi-Markov SVMs (Schweikert et al., Genome Res., 2009). mGene outperformed other gene finders in the nGASP competition on nematode genome annotation (on average over all evaluation criteria) (Coghlan et al., BMC Bioinf., 2008). Recently, it was extended to additionally take advantage of direct transcriptome measurements from tiling arrays or RNA-seq alignments leading to an improved prediction performance. mGene.web makes this system available as a galaxy web service (Schweikert et al., NAR., 2009).

For the initial version of mGene, with which we participated at nGASP, I contributed to the development of sensors recognizing the polyadenylation site, as well as sensors that distinguish between different segments (for example, exons and introns) based on DNA sequence content. Moreover I developed sensors discriminating in-frame nucleotide composition from sequences (frames) that are not translated (Schweikert et al., Genome Res., 2009).

Currently we are investigating how discriminative structure prediction algorithms can be applied to gene prediction in prokaryotes and how techniques from multi-task learning are beneficial to ideally transfer gene prediction models from one species to another or to derive models which generalize well across large branches of the tree of life (Görnitz et al., NIPS, 2011).

Tiling Array-Based Transcriptomics

2006 - 2011Carried out at FML / MPI


illustrationUsing whole genome tiling arrays, we studied the transcriptomes of model organisms and their dynamics. Their comprehensive representation by Affymetrix tiling arrays allowed us to monitor the expression of known genes as well as to identify of new, previously uncharacterized (non-coding) genes and alternative transcript isoforms. We leveraged the tiling array technology in conjunction with machine learning methods for data analysis to i) create an inventory of differential expression of known and new genes for various Arabidopsis thaliana organs and several developmental stages (Laubinger et al., Genome Biol., 2008), as well as under changing environmental conditions (Zeller et al., Plant J., 2009); ii) to obtain a detailed expression map for Caenorhabditis elegans tissues and organs at cellular resolution (Spencer et al., Genome Res., 2011) as part of the modENCODE project, which aims to gain a detailed understanding of the functional elements encoded in the worm genome (Gerstein et al., Science, 2010); iii) to characterize on a global scale the transcriptome changes resulting from deficiencies in regulators of mRNA capping, splicing, and biogenesis of small RNAs (Laubinger et al., PNAS., 2008, Laubinger et al., PNAS., 2010).

I developed machine learning-based methods for normalization and segmentation of tiling array data with the twofold goal to reduce probe sequence effects on hybridization intensity and accurately segment transcribed exons out of the intergenic/intronic background (Zeller et al., Pac. Symp. Biocomput., 2008). These normalization and segmentation methods allowed for very accurate de novo transcript identification (Laubinger et al., Genome Biol., 2008) and were crucial tools to complete the above-mentioned projects.

A further subproject identified alternative splicing events, primarily intron retentions, from tiling array data. We approached this task as a two-step classification problem: First, a classifier was trained to discriminate between exons and introns. Subsequently, introns with a large inclusion probability in some tissue were identified as candidates for alternative splicing by a second classifier (Eichner et al., BMC Bioinf. 2011).

Array-Based Polymorphism Discovery

2005 - 2008Carried out at FML / MPI


illustrationUsing resequencing array technology, this project aimed at characterizing common sequence polymorphisms in 20 diverse strains of the plant model organism Arabidopsis thaliana (Clark et al., Science, 2007). Based on this work, a 250k SNP chip was developed for cost-effective genotyping, which enables high-resolution genome-wide association studies in Arabidopsis thaliana (e.g. Atwell et al., Nature, 2010). A similar project catalogued and analyzed genetic variation between rice varieties (McNally et al., PNAS, 2009). My focus has been on detecting highly polymorphic regions with a machine learning technique called Hidden Markov Support Vector Machines (Zeller et al., Genome Res., 2008). Polymorphic regions typically correspond to clusters of small polymorphisms (SNPs and indels), but also include large deletions. As these types of polymorphisms are impossible or very difficult to identify with SNP calling methods, polymorphic region predictions ideally complement SNP data by adding an inventory of variations expected to have pronounced phenotypic effects when affecting genes and other functional genomic elements (Zeller et al., Genome Res., 2008).