Each cell type in a solid tissue has a characteristic transcriptome and spatial arrangement, both of which are observable using modern spatial omics assays. Surprisingly however, spatial information is frequently ignored when clustering cells to identify cell types and states. In fact, spatial location is typically considered only when solving the related, but distinct, problem of demarcating tissue domains (which could include multiple cell types). We present BANKSY, an algorithm that unifies cell type clustering and domain segmentation by constructing a product space of cell and neighborhood transcriptomes, representing cell state and microenvironment, respectively. BANKSY's spatial kernel-based feature augmentation strategy improves performance on both tasks when tested on diverse FISH- and sequencing-based spatial omics datasets. Uniquely, BANKSY identified hitherto undetected niche-dependent cell states in mouse brain. We also show that quality control of spatial omics data can be formulated as a domain identification problem and solved using BANKSY. Lastly, BANKSY is orders of magnitude faster and more scalable than existing spatial clustering methods, and thus capable of processing the large datasets generated by emerging spatial technologies. In summary, BANKSY represents an accurate, biologically motivated, scalable, and versatile framework for analyzing spatial omics data.
Tuesday, July 5th, 2022
The integration of single cell and spatial transcriptomics provides a new approach to profile human disease pathology in situ. Here, I will introduce our work on dissecting lung alveolar damage in severe COVID-19 using a new single cell atlas and transcriptome wide spatial profiling of post-mortem lung tissue. First, we generated a comprehensive single-cell lung cell atlas through integration of multiple healthy and COVID-19 datasets. Second, we generated a spatially resolved transcriptomic dataset of diffuse alveolar damage (DAD) across different stages of pathology using the Nanostring WTA technology. To resolve changes in cell type abundance across progressive pathology, we integrated our single cell and spatial transcriptomic datasets. We identified dynamic sets of immune and stromal cells and tissue microenvironments that distinguish early (exudative) and late (organised) alveolar damage. Finally, we could re-map pathological phenotypes in our single-cell transcriptomic reference using pathology biomarkers identified from spatial data. Our work identifies candidate molecular and cellular targets of novel therapies for COVID-19 in the respiratory system.
Understanding the regulatory landscape of the human genome has been a long-standing goal of modern biology. Contemporary approaches identify regulatory elements using biochemical signals including epigenetic marks and transcription factor (TF) occupancy; evolutionary constraint of the resulting elements varies. The Zoonomia consortium’s 241 genomes are sufficient to achieve single-base resolution of evolutionary constraint in placental mammals. We used Zoonomia’s reference-free genome alignment and conservation score to characterize the human regulatory landscape, examining roughly one million candidate cis-regulatory elements (cCREs), 21 thousand core-promoters, and 15.6 million sites bound by 367 TFs (TFBSs). We identified a group of cCREs (439,461, occupying 4% of the human genome) and TFBSs (2,024,062; 0.8% of the genome) under mammalian constraint. Genes near constrained elements function in fundamental cellular processes like metabolism and development, and these elements yield high heritability enrichment for a panel of 69 diverse human traits. Unconstrained elements lie near genes that allow mammals to negotiate their environment (odor perception, immune response, and transposon repression), and 132 TFs are enriched in binding to genomic repeats. Our annotated elements should help interpret the regulatory landscape of the human genome.
Wednesday, July 6th, 2022
Big Data analytical techniques and AI have the potential to transform drug discovery, as they are reshaping other areas of science and technology, but we need to blend biology and chemistry in a format that is amenable for modern machine learning. In this talk, I will present the Chemical Checker (CC), a resource that provides processed, harmonized and integrated bioactivity data on small molecules. The CC divides data into five levels of increasing complexity, ranging from the chemical properties of compounds to their clinical outcomes. In between, it considers targets, off-targets, perturbed biological networks and several cell-based assays such as gene expression, growth inhibition and morphological profiles. In the CC, bioactivity data are expressed in a vector format, which naturally extends the notion of chemical similarity between compounds to similarities between bioactivity signatures of different kinds. We show how CC signatures can boost the performance of drug discovery tasks that typically capitalize on chemical descriptors, including compound library optimization, target identification and anticipation of failures in clinical trials. Moreover, we demonstrate and experimentally validate that CC signatures can be used to reverse and mimic biological signatures of disease models and genetic perturbations, options that are otherwise impossible using chemical information alone. Indeed, using bioactivity signatures we have identified small molecules able to revert transcriptional signatures related to Alzheimer´s disease in vitro and in vivo, as well as compounds against Snail1, a transcription factor with an essential role in the epithelial-to-mesenchymal transition, showing that our approach might offer a new perspective to find small molecules able to modulate the activity of undruggable proteins.
Metastasis is the primary cause for mortality in cancer. In our studies we show that breast cancer metastasis can be prevented by limiting cell movement using small RNA treatment. We compare experimental models that reveal the pathways involved in cancer aggressiveness. We pinpoint potential diagnostic markers that dictate the course of the disease. Overall, large data analysis and experimental studies assist in better understanding cancer genomics.
We develop artificial intelligence (AI) methods for extracting new biomedical knowledge from the wiring patterns of systems-level, heterogeneous omics data. Our graphlet-based and other methods uncover the patterns in molecular (omics) networks and in the multi-scale organization of these networks indicative of biological function, translating the information hidden in the network topology into domain-specific knowledge. Also, we introduce a versatile data fusion (integration) machine learning (ML) framework to address key challenges in precision medicine from the wiring patterns of omics network data: better stratification of patients, prediction of driver genes in cancer, and re-purposing of approved drugs to particular patients and patient groups, including Covid-19 patients. Our new methods stem from novel network science algorithms coupled with graph-regularized non-negative matrix tri-factorization (NMTF), a machine learning technique for dimensionality reduction, inference and co-clustering of heterogeneous datasets. We utilize our new framework to develop methodologies for understanding the molecular organization of the omics data embedding space.
The chromosomes of the human genome are organized in three-dimensions by compartmentalizing the cell nucleus and different genomic loci also interact with each other. However, the principles underlying such 3D genome organization and its functional impact remain poorly understood. In this talk, I will introduce some of our recent work in developing representation learning methods to study single-cell 3D genome organization. Our methods reveal the single-cell chromatin interactome patterns in different cellular conditions and at different scales. We hope that these algorithms will provide new insights into the structure and function of nuclear organization in health and disease.
I will discuss a unifying statistical formulation for many fundamental problems in genome science and develop a reference-free, highly efficient algorithm that solves it. This formulation allows us to construct an algorithm that performs inference on raw reads, avoiding references completely. We illustrate the power of our approach for new data-driven biological discovery with examples of novel single-cell resolved, cell-type-specific isoform expression, including splicing, expression in the major histocompatibility complex, and de novo prediction of viral protein adaptation including in SARS-CoV-2.
Thursday, July 7th, 2022
For understanding how the microbiome and viral infections contribute to non-communicable human diseases, it is important to understand the network perturbations effected by these microbial agents. At the Institute of Network Biology (INET) we aim to understand the principles of protein interaction network function, define patterns of network perturbation by disease genetics and how microbial and viral perturbations interact with host genetics to cause modulate genetic disease risk. We use an integrated approach consisting of high-throughput network mapping, bioinformatic and deep-learning analyses, and targeted validation of specific hypotheses. I will present unpublished data on systematic interactome maps of coronaviral and microbiome encoded proteins in the human host network, their relation to human genetics, and extensive functional validation data.
Complex traits are established through the joint influences of multiple genetic and environmental perturbations. There is a shortage of generalizable principles explaining how molecular networks integrate genetic and environmental effects ultimately leading to complex cellular and organismal traits. In particular, it is poorly understood when and how genetic perturbations lead to molecular changes that are confined to small parts of a network versus when they lead to large-scale adaptations of global network states. Here, we present a concept classifying genetic effects as local, regional or global depending on what fraction of a molecular network they affect. We exemplify this notion using transcriptome, proteome and phospho-proteome profiling of genetically heterogeneous populations of yeast strains, which we integrate with an array of cellular traits. Our analysis identified a central gauge of the yeast molecular network that is related to PKA and TOR (PT) signaling. The resulting ‘PT state’ could be summarized in a single value that explained large parts of the molecular configuration of the strains. This PT state associated with a specific balance between cellular processes spanning energy- and amino acid metabolism, transcription, translation, cell cycle control and cellular stress response. Carbon source quality, oxidative stress, and gene-environment interactions caused monotonic shifts of the molecular network state along the same axis. We further show that complex traits like heat stress resistance and longevity (stationary phase viability) result from the synthesis of genetic effects modulating this PT state with global network effects, plus much more trait-specific effects modulating only small parts of the network. Our work provides a rational for the conditions under which genetic effects propagate through molecular networks with pleiotropic consequences.
Complementary methods are required to fully characterize multiprotein complexes in vitro and in vivo. Affinity purification coupled to mass spectrometry (MS) can identify the composition of protein complexes at scale. However, information on direct contacts between subunits is often lacking. In contrast, solving the 3D structure of protein complexes by X-ray diffraction or cryo-electron microscopy can provide this information, but is not yet scalable for proteome-wide efforts. We have developed quantitative bioluminescence-based methods that facilitate binary interaction mapping in mammalian cells with sensitivity and specificity. We have applied these technologies to study the associations of huntingtin (HTT), a protein of unknown function at the root of Huntington’s disease. We found that HTT controls the abundance of its partner HAP40 in mammalian cells, suggesting that it functions as a scaffold preventing the degradation of partner proteins in mammalian cells. In another systematic screen, we identified high-confidence binary interactions for proteins of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which subsequently were entered into an in silico compound screening. We discovered a new chemical compound that directly targets the interaction between NSP10 and NSP16, which is critical for virus replication. Finally, we defined partners for the AAA ATPase p97, which interacts with many proteins and plays a functional role in various subcellular processes. We found that p97 associates with splicing regulators in an ASPL-dependent manner, suggesting a functional link between the p97:ASPL complex and mRNA processing. Overall, systematic mapping of direct interactions between proteins in higher-order protein assemblies facilitates a better understanding of cellular and disease processes. Also, high-confidence binary interactions are important drug targets with a high potential for innovation in therapy development.
TBD
Cancer genomes accumulate many somatic mutations resulting from imperfection of DNA processing during normal cell cycle as well as from carcinogenic exposures or cancer related aberrations of DNA maintenance machinery. These processes often lead to distinctive patterns of mutations, called mutational signatures. Considering these signatures as quantitative traits, we can leverage them for studies of the interactions between mutagenic processes, other cellular processes, and environment. Untangling these interactions is critical for understanding the processes underlying mutational signatures and their impact on the organism. I will discuss several computational approaches including a method for the deconvolution of the contributions of DNA damage and repair to the mutational landscape of cancer.
A powerful method to study the genotype-to-phenotype relationship is the systematic assessment of mutant phenotypes using genetically accessible model systems. We have developed and applied methods for quantitative analysis of genetic interactions in double mutants using yeast colony size as a proxy for cell fitness. Our global digenic interaction network reveals a hierarchy of functional modules, including pathways and complexes, bioprocesses and cell compartments. We have also expanded our systematic genetics pipeline to include single cell image-based readouts and arrays of yeast strains expressing GFP-tagged proteins for exploration of proteome dynamics and the effects of genetic perturbations on subcellular compartment morphology. Recently, we have leveraged the principles about genetic networks that we discovered in yeast to map genetic interactions in human HAP1 cells using genome-wide CRISPR/Cas9 screens. Our yeast work guided our selection of query genes to screen and provided a road-map for extraction of functional information from the resulting data. The interactions screened to date include more than 85% of the genes in the human genome that are expressed in HAP1 cells, and as was observed in yeast, interaction profile similarity is highly predictive of gene function. I will describe our results in the context of our ongoing efforts to discover the principles of genetic networks in yeast and apply what we learn to understand the functional organization of the human genome.
Friday, July 8th, 2022
Changes in gene regulation were a major driver of the divergence of archaic hominins (AHs)— Neanderthals and Denisovans—and modern humans (MHs). The three-dimensional (3D) folding of the genome is critical for regulating gene expression; however, its role in recent human evolution has not been explored because the degradation of ancient samples does not permit experimental determination of AH 3D genome folding. To fill this gap, we apply deep learning methods for inferring 3D genome organization from DNA sequence to Neanderthal, Denisovan, and diverse MH genomes. Using the resulting 3D contact maps across the genome, we identify 167 distinct regions with diverged 3D genome organization between AHs and MHs. We show that these 3D-diverged loci are enriched for genes related to the function and morphology of the eye, supra-orbital ridges, hair, lungs, immune response, and cognition. Despite these specific diverged loci, the 3D genome of AHs and MHs is more similar than expected based on sequence divergence, suggesting that the pressure to maintain 3D genome organization constrained hominin sequence evolution. We also find that 3D genome organization constrained the landscape of AH ancestry in MHs today: regions more tolerant of 3D variation are enriched for introgression in modern Eurasians. Finally, we identify loci where modern Eurasians have inherited novel 3D genome folding from AH ancestors, which provides a putative molecular mechanism for phenotypes associated with these introgressed haplotypes. In summary, our application of deep learning to predict archaic 3D genome organization illustrates the potential of inferring molecular phenotypes from ancient DNA to reveal previously unobservable biological differences.
Approaches for the identification of disease causal mutations are widely applied in research and clinical settings, but interpretation and ranking of the resulting variants remains challenging. Combined Annotation Dependent Depletion (CADD, https://cadd-sv.bihealth.org/) integrates annotations by contrasting variants that survived purifying selection along the human lineage with simulated mutations to score short sequence variants (SNVs, InDels, multi-allelic substitutions). Since its publication (Kircher, Witten et al. Nat Genet. 2014), CADD was well adopted by the community and minor adjustments and fixes were released since, including the native support of both GRCh37 and GRCh38 assemblies (Rentzsch et al. NAR 2019). Recently, we assessed existing deep neural network (DNN) models for splice effects with the Multiplexed Functional Assay of Splicing using Sort-seq dataset (MFASS, Cheung et al. Mol Cell. 2019). We selected two DNN models based only on genomic sequence, MMSplice and SpliceAI, which showed the best performance for integration into CADD (Rentzsch et al. Genome Med. 2021). The DNN scores boosted CADD's predictions for splice effects and we noted that while the DNN scores have superior performance on splice variants, they fail to account for nonsense and missense effects of the same variants. This suggests that variant prioritization will improve with more domain-specific information and underlines the importance of identifying additional such features, e.g. for regulatory sequences. With rapid advances in the identification of structural variants (SVs), we decided to apply the general concept of CADD to score them (CADD-SV, https://cadd-sv.bihealth.org/). While methods utilizing individual mechanistic principles like the deletion of coding sequence or 3D architecture disruptions were available, a comprehensive tool that uses the broad spectrum of available SV annotations was missing. We show that CADD-SV scores are predictive of pathogenicity and population frequency and that CADD-SV's ability to prioritize pathogenic variants exceeds that of existing methods like SVScore and AnnotSV (Kleinert & Kircher, Genome Res. 2022). Our results highlight advantages of the CADD approach, like profiting from a large training data set covering diverse and rare feature annotations without major ascertainment effects from historic and on-going variant collections.
How the splicing machinery defines exons or introns as the spliced unit has remained a puzzle for 30 years. Here, we demonstrate that peripheral and central regions of the nucleus harbor genes with two distinct exon-intron GC content architectures that differ in the splicing outcome. Genes with low GC content exons, flanked by long introns with lower GC content, are localized in the periphery, and the exons are defined as the spliced unit. Alternative splicing of these genes results in exon skipping. In contrast, the nuclear center contains genes with a high GC content in the exons and short flanking introns. Most splicing of these genes occurs via intron definition, and aberrant splicing leads to intron retention. We demonstrate that the nuclear periphery and center generate different environments for the regulation of alternative splicing and that two sets of splicing factors form discrete regulatory subnetworks for the two gene architectures. Our study connects 3D genome organization and splicing, thus demonstrating that exon and intron definition modes of splicing occur in different nuclear regions.
I will describe two projects that aim to better dissect the causal chain from functional genetic variant through molecular intermediates and finally to organismal trait or disease risk. In the first, we are using pooled profiling of RNA binding protein (RBPs, splice factors) binding across individuals to measure and then computationally model genetic effects on both binding and RNA splicing. In the second, we have developed a causal network inference method that scales to hundreds of nodes by leveraging convex optimization.
The host defense against invading pathogens consists of pathogen elimination (‘resistance’) and the limitation of tissue damage resulting from host-pathogen interactions (‘disease tolerance’). As disease tolerance is a critical component of the host defense, it has become of particular interest in the treatment of infectious diseases, such as the influenza virus and SARS-CoV-2 infections. However, the identification of distinct molecular programs underpinning disease tolerance and resistance remained obscure. The lack of such molecular understanding has been a barrier in developing therapies that specifically target disease tolerance (or resistance) components of the host defense. In this talk I will show our identification of two distinct gene programs that are co-activated during in vivo IAV infection in lungs. We show that one program is specific to disease-tolerance phenotypes while the other program is specific to resistance phenotypes. We developed and validated the programs using in vivo IAV infection across 33 mouse strains that differ in their physiological ability to resist and tolerate infection, and by integrating transcription profiles of isolated cell types across several human cohorts. The identified decoupling between disease-tolerance and resistance allowed us to reveal novel organizational principles, markers and regulators of the host defense.