No abstract available.
Monday, March 7th, 2016
We are currently working on developing and applying spectral learning algorithms to epigenetics data. Recently, international consortia such as ENCODE and Roadmap Epigenomics have released massive epigenetics data sets from hundreds of human cell types with the aim of interpreting Genome-wide Association Studies for many human diseases. To analyze this data, we have implemented and extensively tested spectral algorithms for HMMs in our Spectacle software and found that they have significantly improved run time and biological interpretability compared to the EM algorithm. This is particularly important when the underlying classes are highly imbalanced, a pervasive issue in biology. To model multiple cell types, we developed novel spectral algorithms for tree structured HMMs and show that the tree model further improves our prediction of functional elements in the genome.
We introduce Salmon, a method for quantifying transcript abundance from RNA-seq reads that is both extremely fast and that supports rich, experiment-specific models to reduce the effects of biases of the RNA-seq protocol. Salmon does this by combining a novel technique for mapping reads to transcripts with a dual-phase stochastic inference algorithm and a feature-rich probabilistic model. These innovations allow Salmon to obtain very accurate estimates of transcript abundance, while improving on the speed of already-fast techniques such as Sailfish.
This is joint work with Rob Patro and Geet Duggal.
Biological processes, including those involved in immune response and disease progression, are often dynamic. To model the regulatory and signaling networks that are activated as part of these systems we are developing methods to combine the abundant static regulatory, proteomic and epigenetic data with time series gene and miRNA expression data. The reconstructed networks characterize the pathways involved in the response, their time of activation, and the affected genes. I will present methods based on probabilistic graphical models and on combinatorial search algorithms for reconstructing these networks and will discuss application of the methods to study response to flu, HIV progression and to the analysis of single cell data.
SELEX-seq and HT-SELEX are sequencing-based methods for elucidating the intrinsic DNA binding specificity of transcription factor (TF) complexes at high resolution. While the amount of raw information that modern SELEX provides is unprecedented, the computational methods for building DNA recognition models (“motifs”) from these data are still far from mature. The standard is to tabulate of the relative enrichment of each oligomer of a given length, for which we have developed efficient software. Unfortunately, having to use oligomer tables as an intermediate step for feature-based analysis has two key disadvantages: (i) limited range over which readout can be analyzed, as counts decrease exponentially with footprint size; and (ii) requirement for prior ad hoc sequence-based alignment of different oligomers. We present a new and versatile framework for motif discovery from SELEX data that overcomes these limitations. It uses a hierarchical maximum likelihood approach to fit a feature-based biophysically motivated protein-DNA recognition model directly to the raw SELEX data. This allows us to consider base and shape readout in more detail and over a larger footprint than was possible before, which we illustrate using data for the steroid hormone receptors AR and GR. We can now for the first time analyze shape readout for TFs with low binding specificity, which we demonstrate using Hox monomer data.
Single cell RNA-sequencing reveals the differences in gene and exon expression levels across individual cells. In particular, recent studies showed considerable difference in the distributions of reads from different cells for the same gene. This variation of isoform usage across single cells was not observed from bulk RNA-seq data. We seek to quantify this variation, understand the sources of the variation, and identify the patterns of the different in isoform usage. To quantify the variation, we have developed a profile-variation (PV) score for each gene while accounting for various confounding factors in the data, and this score allows us to extract genes with highly variable read density profiles across cells.
Based on the PV score we can study the sources of the transcript variation. Gene Ontology analysis of genes with high PV reveals two levels in the isoform variation in terms of gene functions. As we analyzed date sets from different cell types, we found that the first level of functions are common for all cell types, whereas the second level of functions is cell type specific, for example, immunology related functions in activated T helper cells. We further studied the patterns of the isoform usage across cells. Although we found genes which switch isoforms between cell types, they do not switch in a correlated manner, showing high stochasticity in isoform generation in single cells. Finally, we show that applying our PV score on single cell RNA-seq data finds genes which are not detected on bulk RNA-seq data with traditional methods to be differentially spliced, and these genes potentially represent the gradual change from one cell type to another.
Tuesday, March 8th, 2016
Although deoxyribonuclease I (DNase) was used to probe the structure of the nucleosome in the 1960s and 70s, in the current high-throughput sequencing era, DNase has mainly been used to study genomic regions where nucleosomes are absent. Here, we show that DNase can be used to precisely map the (translational) positions of in vivo nucleosomes genome-wide. Specifically, exploiting a distinctive DNase cleavage profile within nucleosome-associated DNA, we develop a Bayes-factor–based method that can be used to map nucleosome positions along the genome. Compared to methods that require genetically-modified histones, our DNase-based approach is easily applied in any organism, which we demonstrate by producing maps in yeast and human. Compared to MNase-based methods that map nucleosomes based on cuts in linker regions, we utilize DNase cuts both outside and within nucleosomal DNA; the oscillatory nature of the DNase I cleavage profile within nucleosomal DNA enables us to identify translational positioning details not apparent in MNase digestion of linker DNA. Because the oscillatory pattern corresponds to nucleosome rotational positioning, it also reveals the rotational context of transcription factor (TF) binding sites. We show that potential binding sites within nucleosome-associated DNA are often centered preferentially on an exposed major or minor groove. This preferential localization may modulate TF interaction with nucleosome-associated DNA as TFs search for binding sites.
High throughout sequencing technologies are now allowing us to interrogate intermediate layers of gene expression from nascent transcription to translation. At the same time, new sequencing protocols can help to determine where RNA-binding proteins interact with target transcripts and control these different layers of post-transcriptional gene regulation. These new protocols require and motivate dedicated computational approaches to analyze the resulting noisy data. I present our recent and ongoing projects to identify and analyze interactions of RNA-binding proteins and ribosomes on coding and non-coding transcripts.
Our current knowledge of genome function is the result of sequence-based data in the form of one-dimensional strings of letters. However, DNA-binding proteins recognize the double helix as a three-dimensional object. Therefore, an understanding of transcription factor (TF) binding specificity must ultimately include DNA shape. The sequence-structure relationship in DNA is highly degenerate, and different nucleotide sequences can give rise to the same structure, while single nucleotide sequence variants sometimes change DNA shape over a region of several base pairs. To explore these effects on a genomic scale, we developed a method for the high-throughput DNA shape features. We used these structural features to augment nucleotide sequence in binding specificity models derived from statistical machine learning approaches. Based on data derived from high-throughput binding assays for many TFs from diverse protein families, we demonstrated that shape-augmented models are generally more efficient than existing sequence models in terms of accuracy, number of features, and computation time. Our models provide information on the importance of specific DNA sequence and shape features and thus reveal TF family-specific readout mechanisms and better explain why a given TF binds in vivo to a specific genomic target site.
We present novel deep learning frameworks capable of learning jointly from raw DNA sequence and diverse functional genomic profiling experiments to learn fundamental predictive relationships between regulatory sequence, chromatin architecture, chromatin state and transcription factor binding. Recently, the ATAC-seq assay was developed to simultaneously profile chromatin accessibility and architecture of regulatory elements from low input samples based on direct in vitro transposition of sequencing adaptors into native chromatin. We train multi-task, multi-modal deep convolutional neural networks (CNNs) on a novel 2D representation of ATAC-seq data that leverages subtle patterns in insert-size distributions to simultaneously predict multiple histone modifications, combinatorial chromatin state and binding sites of a key insulator protein (CTCF) with high accuracy. Models trained on related assays such as DNase-seq and MNase-seq data also achieve high performance genome-wide and across cell-types supporting a fundamental predictive mapping between local chromatin architecture and chromatin state. We develop novel feature importance scores and visualization methods to extract biologically meaningful predictive patterns from deep neural networks. We further present new deep hybrid architectures consisting of convolutional and recurrent layers to predict in-vivo transcription factor binding events and learn regulatory sequence grammars from raw DNA sequence and chromatin accessibility profiles across cell types and tissues. Our methods potentially enable detailed characterization of context-specific regulatory landscapes from low input samples of rare cell types using a single assay.
Chromosome Conformation Capture technique (Hi-C) provides comprehensive information about frequencies of spatial interactions between genomic loci. Inferring 3D organization of chromosomes from these data is a challenging biophysical problem. We develop a top-down approach to biophysical modeling of chromosomes. Starting with a minimal set of biologically motivated interactions we build ensembles of polymer conformations that can reproduce major features observed in Hi-C experiments. I will present our work on modeling organization of human metaphase and interphase chromosomes. Our works suggests that active processes of loop extrusion can be a universal mechanism responsible for formation of domains in interphase and chromosome compaction in metaphase.
Mutations in enhancers can lead to a wide range of phenotypes, including Mendelian disease; however, we are currently limited in predicting the phenotypic impact of these mutations. With whole-genome datasets becoming commonly available, we need to obtain a better understanding of the functional consequences of nucleotide variants in enhancer sequences. Here, we will present the SHH limb enhancer, termed also as the zone of polarizing activity (ZPA) regulatory sequence (ZRS), as a case study. Several labs including ours have detected mutations in this enhancer that can lead to various limb malformations. Point mutations in this enhancer usually cause polydactyly and triphalangeal thumb, but there are other specific single nucleotide changes in the ZRS that cause a more severe limb phenotype and nucleotide variants that don’t lead to an observable phenotype. Our current tools, both computational and functional, are limited in their ability to predict the phenotypic impact of a novel mutation in this enhancer. Using massively parallel reporter assays (MPRAs) combined with computational tools, we are attempting to address this problem. By designing MPRAs to learn regulatory grammar or to carry out saturation mutagenesis of every possible nucleotide change in the ZRS and other disease causing enhancers, we are increasing our understanding of the phenotypic consequences of enhancer mutations.
Wednesday, March 9th, 2016
Many eukaryotic transcription factor (TF) families comprise multiple members with highly similar DNA-binding specificity. A fundamental problem in modeling eukaryotic gene regulatory networks is identifying and modeling factor-specific differences of TF homologs. High-throughput (HT) biochemical approaches for measuring protein-DNA binding provide the rich datasets needed to identify TF-specific preferences. I will present work using protein-binding microarrays (PBMs) to characterize DNA-binding preferences of TF homologs. I will discuss computational approaches and challenges to identify TF-specific binding preferences from PBM datasets. Finally, I will discuss approaches we are using to understand and model homolog-specificity in gene regulatory networks.
No abstract available.
No abstract available.
Telomeres protect the chromosome ends and play important roles in aging and cancer development. We have systematically screened libraries of the yeast Saccharomyces cerevisiae for mutants with altered telomere length. Our work uncovered ~400 TLM (telomere length maintenance) genes responsible for a strict telomere length homeostasis. These genes, most of which are evolutionarily conserved, span a broad range of functional categories and different cellular compartments. Further work followed both “vertical” (Molecular Biology) and “horizontal” (Systems Biology) approaches. The “vertical” approach aims to explore the role of individual genes in telomere length maintenance using genetic, molecular biology and biochemical methodologies. In the “horizontal” approach a bird’s eye view of the system is obtained by combining molecular and systems biological methods. We have started to chart the cellular network underlying telomere length, revealing a complex set of genetic interactions responsible for the very tight length homeostasis. In addition, we have found that environmental cues can affect telomere length and we have started to investigate the interphase between this intricate genetic network and environmental signals that affect telomere length. Thus, for the first time, it is possible to study the interphase between genome and environment (nature and nurture) in a system in which almost all the genetic “players” are known, and the environment affects them.
Thursday, March 10th, 2016
When we study gene regulation, majority of computational models developed historically, ignore the spatial position of genes, as we had almost no information on chromosome structure in living cells at the resolution required for gene expression regulation models. In recent years, several different experimental procedures based on chromosome conformation capture have been developed to probe the contacts between chromosome and combined with next-generation sequencing, they gave us unprecedented insight into the relative distances between various parts of chromosomes in different cell types. From the computational point of view, the chromosome contact matrices pose multiple challenges and interesting problems: from data normalization and statistical testing of contact significance to the more complicated questions regarding the modular structure of regulatory domains. I will talk about two computational approaches: SHERPA (Simple HEuRistic Pearson Aggregation) and OPPA (Optimal PCA-like Pearson Aggegation) that aim at finding the optimal division of chromosomes into hierarchical domain structure and I will give examples where we can see that such approach gives better results than classical division into flat topological domain structure.
The Encyclopedia of DNA Elements (ENCODE) Consortium has generated tens of thousands of high-throughput genomic datasets with the goal of cataloging all of the functional elements of the human genome. Now, our goal is to integrate these complex data types to annotate regulatory elements such as enhancers and create an encyclopedia of elements for the human and mouse research communities.
We began by analyzing enhancer prediction methods. We tested many different models incorporating data such as DNase-seq, histone mark ChIP-seq, and DNA methylation. We evaluated our methods using experimentally validated enhancer regions from the VISTA enhancer database on four embryonic mouse tissues: limb, hindbrain, midbrain, and neural tube. Overall, the best performing method was centering predictions on DNase peaks and ranking these peaks by the average rank of DNAse and H3K27ac signal. We then applied this method to all mouse and human cell types in ENCODE.
After identifying candidate enhancers, we next sought to identifying the target genes of these regions. In order to evaluate different methods, we created training/validation/test datasets from promoter capture Hi-C datasets in GM12878. We began by analyzing correlation based methods where enhancer-gene links are predicted by high correlation of DNase or H3K27ac signal across multiple cell types. While these methods have previously been used in the literature, we found that they performed poorly (AUROC=0.6 , AUPR=0.06). We then decided to use a Random Forest based approach which would incorporate additional data such as distance between the gene and enhancer, average DNase and H3K27ac signals as well as correlation. Though model had a substantial increase in performance (AUROC=0.78, AUPR=0.16) there is still a great deal of improvement that can be made. We hope to add additional features to our model as well as find the best performing model with limited features that can be applied across many different ENCODE cell types.
Recent technological advancements allow the measurement of protein binding to thousands of DNA or RNA probes on a single microarray. Since the space on the array is limited, the challenge is how to efficiently generate a minimum-size set of sequences that together cover all k-mers. In this talk, we will first introduce de Bruijn graphs and their applications in efficient coverage of DNA k-mers. Then, we will describe a generalization of the problem, in which the sequences are required to obtain certain properties (e.g., unstructured RNAs). We will prove that in this formalization the problem is NP-hard and give a (infeasible) approximation algorithm. We will present a heuristic based on random walks in de Bruijn graphs which works well in practice. If time allows, we shall discuss questions arising in analysis of novel high throughput in vitro methods for motif discovery.
Gaussian processes provide a convenient and flexible class of non-parametric model for temporal and spatial data. We are applying Gaussian processes in a range of biological applications involving high-throughput time course data, e.g. modeling the elongation dynamics of polymerase, uncovering mRNA production delays, inferring regulatory networks and most recently identifying perturbations and bifurcations from high-throughput expression data. I will provide an overview of Gaussian process inference and describe some of our recent work in modeling gene expression dynamics.
The position weight matrix (PWM) model of binding site motifs of transcription factors specifies a multinomial distribution of sequences that has only one dominating seed sequence. To make the model more accurate, one can use several seeds and also utilize the fact that the transcription factors not only bind to DNA but also to each other, forming dimeric and higher order regulatory complexes. Moreover, the internal dependencies possibly present within the motif should be represented by the model. The talk will describe developments in modeling and predicting binding motifs, using multiseed models that include mixtures of monomeric and dimeric PWMs, and are learned from large sequence sets.
No abstract available.