No abstract available.
Monday, April 11th, 2016
Biological networks of an organism show how different bio-chemical entities, such as enzymes or genes interact with each other to perform vital functions for that organism. In this talk, we will discuss the computational challenges centered on uncertainty in the topology of biological networks. We will discuss our new mathematical model, which represent probabilistic networks as collections of polynomials. We show that this is a powerful model that enables solving seemingly very tough computational problems on probabilistic networks efficiently and precisely. We will demonstrate the expressive power of this model on the signal reachability problem, which computes whether an extracellular signal reaches from a membrane receptor to a reporter gene.
Large-scale biological networks map functional relationships between most genes in the genome and can potentially uncover high level organizing principles governing cellular functions. Despite the availability of an incredible wealth of network data, our current understanding of their functional organization is very limited and essentially opaque to biologists. To facilitate the discovery of functional structure and advance its biological interpretation, we developed a systematic quantitative approach to determine which functions are represented in a network, which parts of the network they are associated with and how they are related to one another. Our method, named Spatial Analysis of Functional Enrichment (SAFE), detects network regions that are statistically overrepresented for a functional group or a quantitative phenotype of interest, and provides an intuitive visual representation of their relative positioning within the network. Using SAFE, we examined the most recent genetic interaction network from budding yeast Saccharomyces cerevisiae, which was derived from the quantitative growth analysis of over 20 million double mutants. By annotating the genetic interaction network with GO biological process, protein localization and protein complex membership data, SAFE showed that the network is structured hierarchically and reflects the functional organization of the yeast cell at many different levels of resolution. In addition, we analyzed the network using a large-scale chemical genomics dataset and generated a global view of the yeast cellular response to chemical treatment. This view recapitulated the known modes-of-action of chemical compounds and identified a potentially novel mechanism of resistance to the anti-cancer drug bortezomib. Our results demonstrate that SAFE is a powerful tool for annotating biological networks and a unique framework for understanding the global wiring diagram of the cell.
Cellular processes are largely controlled by the protein-protein and protein-DNA interactions that define them. While conservation of common protein domains can indicate which proteins are likely to engage in these interactions, how they determine what partners to interact with is a much more complicated question. Many experimental techniques have been developed to answer this question however, many are either biased towards high affinity interactions, can be labor intensive, or they require specialized equipment or expertise. To address these limitations we are expanding the application of a simple bacteria hybrid assay that employs multiple reporters simultaneously. By normalizing the output of a test reporter to the presence of a secondary reporter we are able to return outputs that are strongly correlated to the affinity of the test interaction. We have applied this approach to measure both protein-DNA and protein-protein interactions, recovering signal above background for known, low affinity interactions that are often missed by common methodology. We hope that continued development of this platform will allow us to harness the 109 transformation efficiency of bacteria and screen large libraries to capture the low end of affinity while providing affinity-informed specificities.
We are interested in the causes of bronchopulmonary dysplasia (BPD), a respiratory complication of preterm birth whose etiology is the subject of ongoing debate. Molecular causes of this disorder, and their potential relationship with lifelong respiratory health, are relatively unexplored. We consider the problem of identifying molecular pathways implicated in BPD and two pulmonary disorders affecting patients at later life stages (asthma and COPD). In this talk, we will define the notion of "pathway centrality" in a molecular network and demonstrate how this concept can be used to find pathways potentially mediating observed expression changes in pulmonary disorders. Our observations identify common molecular pathways and processes between all three disorders, generate novel hypotheses, and highlight developmental delays that may contribute to BPD. A temporal modeling technique based on outlier detection methods lends additional support to the developmental delay hypothesis.
Tuesday, April 12th, 2016
Rapid advances in high-throughput technologies, including next-generation sequencing, proteomics, and metabolomics, are providing exceptionally detailed descriptions of the molecular changes that occur in diseases. However, it is difficult to use these data to reveal new therapeutic insights for several reasons. Despite their power, each of these methods still only captures a small fraction of the cellular response. Moreover, when different assays are applied to the same problem, they provide apparently conflicting answers. I will show how specific network modeling approaches reveal the underlying consistency of the data by identifying small, functionally coherent pathways linking the disparate observations. These patient-specific networks may provide critical insights for targeted therapies.
Identification and prioritization molecular alterations that potentially act as drivers of cancer remain as a crucial challenge in cancer genomics and a bottleneck in the therapeutic development. The problem is particularly complicated by extensive mutational heterogeneity observed in the cancer (sub)types, yielding a long-tailed distribution of mutated genes across the patients, possibly implying the existence of many private drivers. In order to address this problem we have developed HIT’nDRIVE, a combinatorial algorithm that integrates genomic and transcriptomic (expression) data to identify patient-specific gene alterations that can collectively influence the dysregulated transcriptome of the patient. HIT’nDRIVE aims to solve the “random-walk facility location” (RWFL) problem on a gene/protein interaction network – thus differs from the standard facility location problem by its use of “hitting time”, the expected minimum number of hops in a random-walk originating from any sequence altered gene (i.e. a potential driver) to reach an expression altered gene, as the distance measure. Interestingly, hitting time when used as a distance measure, the distance between multiple facilities and a “target” is not the minimum distance. HIT’nDRIVE reduces RWFL (with multi-hitting time as the distance) to a weighted multi-set cover problem, which it solves as an integer linear program (ILP). Applying HIT’nDRIVE to 2200 (TCGA) tumors from four major cancer types has revealed many potentially druggable driver genes, several of which happen to be private. It is also possible to perform accurate phenotype prediction for these samples by only using HIT’nDRIVE implied driver genes and their “network modules of influence” (subnetworks involving each driver gene where the aggregate expression profile correlates well with the cancer phenotype) as features, providing additional evidence that these genes may be driving the cancer phenotype. Further analysis of these modules reveals patterns of mutual exclusivity among multiple driver genes modulating oncogenic or metabolic networks.
Wednesday, April 13th, 2016
The ENCODE project, via generation of unprecedented transcriptomic and epigenomic profiles, has revealed a complex layer of transcriptional regulation mediated by distal regulatory enhancers distributed throughout the human genome. These data open up more questions than they answer. We will discuss their nuclear organization, how they can be used for better genotype-phenotype associations and their potential emergent properties by virtue of their spatial proximity.
Non-coding variants implicated in genome-wide association studies (GWAS) are enriched in enhancer elements active in disease-relevant cellular contexts. Identifying context-specific target genes and downstream pathways affected by enhancers harboring regulatory variants remains a challenge. We develop novel learning algorithms that leverage the modular dynamics of gene expression and enhancer associated chromatin marks across a vast collection of diverse human cell types and tissues from the ENCODE and Roadmap Epigenomics Projects to infer highly-connected, context-specific enhancer-gene networks. Chromatin conformation maps and expression QTLs validate the superior accuracy and tissue-specificity of our predicted networks compared to existing approaches. We find that a significant proportion of enhancers do not associate with their nearest genes indicating pervasive distal regulation potentially mediated by long-range chromatin contacts. Linked enhancers significantly improve tissue-specific regression models of gene expression. Distal co-association of regulatory sequence motifs suggests synergistic regulation of genes by multiple enhancers with a key role for protein-protein interactions between lineage-specific transcription factors in mediating enhancer-promoter interactions. Networks of cooperating enhancers with shared motif composition and target genes are depleted of disease-associated variants, suggesting regulatory buffering mechanisms. We demonstrate the utility of our context-specific enhancer-gene links to predict putative target genes, biological processes and pathways of non-coding variants associated with diverse traits and diseases
In network biology, a cell is commonly described as a gene regulatory network and as such a cell-type is modeled by a state-dependent system over the network. Hence, understanding the topological structures of gene regulatory network plays a crucial role in uncover the biology of cell types. The talk will cover our recent work on the topological structures and dynamics of cell-specific regulatory networks.
Understanding and predicting phenotypic effects of gene copy number variations is for understanding for understanding the way in which cell buffers expression changes and in for diseases studies. Genetic alterations propagate trough the molecular system disrupting biological activities within cells. In particular, it has become clear that deviations from normal gene dosage are associated with multiple disorders in a range of species including humans. Genome-wide expression profiling Drosophila melanogaster deficiency heterozygotes reveals diverse genomic responses. We have systematically examined deficiencies on the left arm of chromosome 2 and (i) characterize gene-by-gene dosage responses/compensations (ii) their impact on gene network (iii) their impact expression noise and (iv) developed methods to utilize this data to study TF-gene regulation. We show that, surprisingly, expression noise was increased by gene dosage compensation – a property of gene deletions that could contribute to the phenotypic heterogeneity of diseases associated with haploinsufficiency. Additionally, we show that both – expression chances and expression variations associated with reduced dose of transcription factors propagate through the gene interaction network, impacting a large number of downstream genes. Finally, we utilized our data to learn new regulatory interaction vie a new iterative algorithm called Rewire Network Component Analysis (Rewire_NCA) that we developed for this purpose.
The fission yeast Schizosaccharomyces pombe has more metazoan-like features than the budding yeast Saccharomyces cerevisiae with similarly facile genetics. Yet, it is significantly under-studied with little functional genomic information available. Here, we screened the whole fission yeast proteome three times (>75 million protein pairs) to generate the first high-coverage high-quality binary interactome network for S. pombe, FissionNet, comprising ~2300 interactions among ~1300 proteins. ~50% of these interactions were previously not reported in any species. FissionNet unravels previously unreported interactions implicated in processes such as gene silencing and pre-mRNA splicing. We developed a rigorous network comparison framework that accounts for assay sensitivity and specificity, revealing extensive species-specific network rewiring between fission yeast, budding yeast, and human. Surprisingly, although genes are better conserved between the yeasts, S. pombe interactions are significantly better conserved in human than in S. cerevisiae. Our framework also reveals that different modes of gene duplication influence the extent to which paralogous proteins are functionally repurposed. Finally, cross-species interactome mapping demonstrates that coevolution of interacting proteins is remarkably prevalent, a result with important implications for studying human disease in model organisms. Overall, FissionNet is a valuable resource for understanding protein functions and their evolution.
In this work we shift the focus of two common biological network problems: the global network alignment problem and the problem of differential analysis. We do so by moving away from identifying local structural similarities or differences, and instead embed the networks into a continuous metric space based on function. We introduce a new solution, CANDL — Coarsely Aligning Networks with Diffusion and Landmarks. Unlike previous methods that seek to conserve local motifs, this technique focuses instead on finding coherent, functionally related groups of genes across species. In the second part of the talk, we show that by using this functional embedding allows for comparison across networks concerned with differences not just similarities.
Thursday, April 14th, 2016
In January of this year, the number of publicly available gene expression assays topped 1.9 million. Near the time of this workshop, there will be 2 million samples available. Our lab is developing algorithms to integrate these data into models of the underlying biological systems that can be used to discover the pathways and processes that play roles in cells' responses to their environment. One of the methods that we've developed, ADAGE, adapts techniques from deep learning to perform unsupervised extraction of co-regulated modules from noisy publicly available data. Once trained, the ADAGE model can be applied to newly generated data to reveal the pathways altered by a newly performed experiment. This analysis, the output of which resembles a pathway analysis from commonly used software, is unsupervised and entirely data-driven. This means that the technique can be applied to systems for which gene expression data exist but no curated knowledge bases are available. Subsampling analysis suggests that there are currently about 150 organisms for which enough data exists to construct an ADAGE model, and for many of these curated knowledge bases are unavailable or limited to homology-transferred annotations. In addition to continuing methodological developments, we are also developing the software infrastructure to provide data-driven pathway analysis for this set of organisms.
Protein networks are increasingly used to enrich our knowledge about disease by integrating diverse information sources such as sequence and expression data into one computational framework. In this talk I will describe two recent works that use network propagation to associate novel genes and modules with disease. I will demonstrate how the propagation methodology allows processing raw mutation and expression signals to infer disease components that cannot be readily revealed from the measured molecular data.
This is joint work with the labs of Mehmet Koyuturk and Erich Wanker.
The majority of the current methods for analyzing systems-level PPI networks deal with their static representations, due to limitations of biotechnologies for PPI collection, even though cellular functioning is dynamic. For this reason, and because different data types can give complementary biological insights, we integrate current static PPI network data with aging-related gene expression data to computationally infer dynamic, age-specific PPI networks. Then, we apply a series of sensitive measures of network topology to the dynamic PPI network data to study cellular changes with age. For example, we apply a graphlet-based measure of local network position (or centrality) of a node; graphlets are small connected induced subgraphs. By doing so, we find that while global PPI network topologies do not significantly change with age, local topologies (i.e., network centralities) of a number of genes do. We predict such genes to be key players in the processes of aging [1]. We demonstrate the credibility of our predictions by: 1) observing significant overlap between our predicted aging-related genes and known "ground truth" aging-related genes; 2) observing significant overlap between functions and diseases that are enriched in our aging-related predictions and those that are enriched in the "ground truth" data; 3) providing evidence that diseases which are enriched in our aging-related predictions are linked to human aging; and 4) validating our high-scoring novel predictions in the literature.
In systems biology, the solution space for a broad range of problems is composed of sets of functionally associated biomolecules. Since connectivity in molecular interaction networks is an indicator of functional association, such sets can be identified from connected induced subgraphs of molecular interaction networks. Applications typically quantify the relevance (e.g., modularity, conservation, disease association) of connected subnetworks using an objective function and use a search algorithm to identify sets of subnetworks that maximize this objective function. Efficient enumeration of connected subgraphs of a large graph is therefore useful for these applications, and many existing search algorithms can be used for this purpose. However, there is a lack of non-heuristic algorithms that minimize the total number of subgraphs evaluated during the search for subgraphs that maximize the objective function. In this talk, we describe and evaluate an algorithm that reduces the computations necessary to enumerate subgraphs that maximize an objective function given a monotonically decreasing bounding function.
Friday, April 15th, 2016