In humans and many other species, while much is known about the extent and structure of genetic variation, such information is typically not used to aid the assembly of subsequent genomes. Rather, a single reference is used against which to map reads, which can lead to poor characterisation of regions of high sequence or structural diversity. Here, we introduce a population reference graph, which combines multiple reference sequences as well as catalogues of SNPs and indels. The genomes of subsequent samples are reconstructed as paths through the graph using an efficient hidden Markov Model structure in which short read data is efficiently summarised through a de Bruijn graph. By applying the method to the extended HLA MHC region, combining eight assembled haplotypes, sequences of known classical HLA alleles, and 87,640 variants from the 1000 Genomes Project, we show, using SNP genotyping, short-read and long-read data, how the method improves the accuracy of individual genome assembly. Moreover, the analysis reveals regions where the current set of reference sequences is substantially incomplete, particularly within the Class II region, making the case for continued development of reference-quality genome sequences.
Tuesday, February 18th, 2014
Genetic variants that impact gene expression play a central role in the genetics of complex traits and in evolution. Yet the precise links between genetic variation and changes in gene regulation are poorly understood and it remains very difficult to predict which variants have regulatory effects in any given cell type. In this talk I will describe work we have done on identifying genetic variants that impact gene expression and understanding the primary mechanisms by which such variants act.
We describe a general method for finding the probability distribution of neutral genealogies, which allows for migration between demes, splitting of demes (as in the isolation-with-migration (IM) model), and recombination between linked loci. These processes are described by a set of linear recursions for the generating function of branch lengths. Under the infinite-sites model, the probability of any configuration of mutations can then be found by differentiating this generating function. Such calculations are feasible for small numbers of sampled genomes, and can readily be automated. We show how the method extends to continuous genomes, giving the joint distribution of coalescence times and recombination break-points. This allows an assessment of the accuracy of the sequential Markov coalescent, and of methods that assume non-recombining blocks.
Joint work with Konrad Lohse.
Approximating the coalescence process with recombination as a Markov model along sequences—an approach called the Sequential Markov Coalescent or SMC—greatly reduces the complexity of modelling sequences. Rather than modelling the full joint probability of all nucleotides it suffices to model the probability of pairs of neighbouring nucleotides. Combined with hidden Markov models, SMC has been used to develop a number of different inference models in recent years, capable of drawing inference from full genomic sequence alignments.
I will give a short overview of the different methods based on SMC, then talk about a number of models we have developed in Aarhus and how we have used these methods to analyse the genetics of ancestral species, especially the great apes.
I will describe likelihood-based methods that relate traits to phylogenetic tree shapes. The phylogenetic tree of a group of species contains information about character transitions and about diversification: higher speciation rates, for example, give rise to shorter branch lengths. The likelihood methods that we have developed uses the information contained in a phylogeny and integrates over all possible evolutionary histories to infer the speciation and extinction rates for species with different character states. These methods also allow inference of the mode of trait change, be it cladogenetic or anagenetic. These likelihood methods can be used to provide more detailed information than previous methods, allowing us to disentangle whether a particular character state is rare because species in that state are prone to extinction, are unlikely to speciate, or tend to move out of that state faster than they move in. Related applications to within-species phylogenies and to traits that influence the rate of molecular evolution will also be discussed.
Wednesday, February 19th, 2014
I will review some of the population genetic analysis motivated by the sequencing of ancient DNA extracted from Neanderthal and Denisova fossils. The models presented are directed to (1) estimating rates of admixture between extinct archaic populations and the ancestors of some modern human populations, (2) distinguishing between admixture and ancestral population subdivision as an explanation for the greater similarity of Neanderthals and modern non-African populations, (3) detecting evidence of natural selection affecting nucleotide substitutions fixed for the derived state in modern humans and fixed for the ancestral state in Neanderthals and Denisovans, and (4) detecting inbreeding in archaic genomes.
We have recently developed whole-genome in-solution capture (WISC), a fast, flexible, and inexpensive whole genome capture approach which uses biotinylated RNA baits for capturing genomic DNA from genomes of interest (Carpenter et al., 2013, AJHG). Previously, we demonstrated the power of WISC on Illumina libraries created from four Iron Age and Bronze Age human teeth from Bulgaria, as well as bone samples from seven Peruvian mummies and a Bronze Age hair sample from Denmark. Prior to capture, shotgun sequencing of these libraries yielded an average of 1.2% of reads mapping to the human genome (including duplicates). After capture, this fraction increased dramatically, with up to 59% of reads mapped to human and folds enrichment ranging from 6X to 159X. In this talk, we will discuss our extension of WISC to three pressing problems: (1) extending WISC to non-human model systems, (2) improving metagenomic sequencing via human genome subtraction, and (3) development of “gold standards” for validation in ancient DNA (aDNA). In the first project, we capture the genomes of 5 ancient dogs and wolves, including one wolf from the 19th century, one wolf from the Late Pleistocene, two pre-Columbian dogs (one from Mexico and one from Peru), and a Mesolithic dog from Yugoslavia. We demonstrate the adjustments that must be made to adapt the protocol to different species. In the second project, we developed a “negative” capture or subtraction approach where we enrich for genomic DNA not bound to the RNA-baits. In the application to metagenomic sequencing, we saw a depletion of human DNA in saliva from 49% pre-capture (~4.8M Illumina MiSeq reads per sample) to 3% (~275K reads) without loss of complexity in the remaining library. Finally, we will discuss development of “gold standards” for aDNA sequencing via orthogonal technology validation. We have resequenced a subset of the original 12 samples used in the WISC study on the Ion Torrent Proton system and found that orthogonal validation reduces the false positive rate of PCR induced library variants.
Samples of ancient DNA can be used to estimate the frequencies of a given allele at a range of times in the past. One would like to use these frequencies to make inferences about historic population sizes, selection strengths, etc. The likelihood of the frequency data involves the transition densities of a diffusion process for which there are no closed form expressions, and computing these transition densities numerically requires, in principle, solving a partial differential equation for every choice of the underlying parameters. I will describe recent work with Josh Schraiber (UC Berkeley Integrative Biology) in which we "interpolate" the paths of the allele frequency diffusion process and use rejection sampling ideas process as part of a Bayesian procedure for inferring demographic and genetic parameters from allele frequency time series.
We investigate the interplay between natural selection and genetic drift in a model for evolution in a spatial continuum. In particular, we ask the very basic question, how strong must selection be if we are to have any hope of seeing its signal in variation at linked loci?
This is joint work with Nic Freeman (Bristol) and Daniel Straulino (Oxford).
Thursday, February 20th, 2014
I will discuss the amount of information needed to reconstruct gene trees from species trees and population histories from coalescence times.
Based on joint work with Sebastien Roch and with Junhyong Kim, Miki Racz and Nathan Ross.
We present a flexible and robust simulation-based framework to infer demographic parameters from the site frequency spectrum (SFS) computed on large genomic data sets. This composite-likelihood approach allows the study of arbitrarily complex evolutionary models. We compare our approach to dadi, the current reference in the field, and show that our approach has better convergence properties for complex models. We present an application of our methodology to non-coding genomic SNP data from four human populations and further show the versatility of our framework by extending it to the inference of demographic parameters from SNP chips with known ascertainment, such as that recently released by Affymetrix to study human origins. We discuss potential extensions of our approach, which appears generally well suited to study complex scenarios from large genomic data sets.
Joint work with Vitor C. Sousa.
We discuss new theory that allows the calculation of joint allelic spectra from a phylogeny defined by an admixture graph. In contrast to other methods which use numerical p.d.e techniques our methods are algebraic. We show some interesting examples and describe the significant challenges to applications.
Chromosomal segments that are identical by descent (IBD) were recently shown to convey information about population-level features such as demography, natural selection and heritability of common traits. In a recent work [1], we have developed analytical models for the relationship between haplotype sharing and demography, and shown that IBD sharing provides an effective way for reconstructing demographic events of the recent millennia, where classical methods are typically underpowered. We now extend the developed models to accommodate the simultaneous analysis of multiple demes, providing insight into recent migration rates as well as population size fluctuations. Using this approach we analyzed sequencing data for 498 unrelated individuals from 11 Dutch provinces (The Genome of Netherlands Project). Pairs of individuals from all the analyzed provinces are found to share several IBD segments of length greater than 1 centimorgan (cM), suggesting recent common ancestry of these groups. We observe a north-to-south gradient of declining IBD sharing frequency. While the chance of sharing long (>7 CM), extremely recent IBD segments correlates with modern-day geographic distance, shorter segments are more frequently shared with individuals currently residing in the north of the country, regardless of the individuals’ modern location. Using the developed analytical methods, we reconstruct coalescent distributions and migration rates across the analyzed provinces. In all cases we find evidence for recent exponential growth at different rates for different provinces, with substantial recent gene flow between these demes. Using the retrieved model, we estimate the average haploid pair of Dutch individuals in the studied dataset to find a common ancestor ~1600 years before present, with earlier common ancestors typically found in northern provinces and variation that depends on modern geographic location.
In recent years many methods for ancestry inference have been proposed, particularly model based methods such as STRUCTURE in which a generative probabilistic model is explicitly described, and classical methods such as principal component analysis that are designed for spatial ancestry inference. We have recently developed a couple of methods (SPA and LOCO-LD) that combine the best of both worlds - these methods incorporate a probabilistic model, but they provide spatial ancestry inference with high accuracy. I will describe these methods, the evaluation of their performance, as well as some insights about the mathematical relations between the methods.
Friday, February 21st, 2014
Modern genetic datasets are revealing features of our genetic history, and how this has shaped our genomes, in unprecedented detail. We will discuss a novel method to jointly estimate the time of divergence of a pair of populations and their variable sizes, assuming a piecewise constant population size through time. The method uses thousands of regions of the genome with very low recombination rate. For each region, we use an importance sampler to build a large number of possible genealogies, and from those we estimate the likelihood function of parameters of interest. We show via simulation studies that the method performs well in many situations, even where a modest amount of recombination or migration has occurred, giving useful estimates down to times as recent as a few thousand years in the past. We apply the method to five populations from the 1000 Genomes Project, obtaining estimates of split times between European groups and among Europe, Africa and Asia. We simultaneously infer shared and non-shared bottlenecks between out-of-Africa groups, expansions following population separations, and ancestral population sizes further back in time. We will discuss ways in which our results agree, and disagree, with those of previous approaches.
Variation in human DNA sequences account for a significant amount of genetic risk factors for common disease such as hypertension, diabetes, Alzheimer's disease, and cancer. Identifying the human sequence variation that makes up the genetic basis of common disease will have a tremendous impact on medicine in many ways. Recent efforts to identify these genetic factors through large scale association studies which compare information on variation between a set of healthy and diseased individuals have been remarkably successful. However, despite the success of these initial studies, many challenges and open questions remain on how to design and analyze the results of association studies. As several recent studies have demonstrated, confounding factors such as population structure and measurement errors can complicate genetics analysis by causing many spurious associations. Yet little is understood about how these confounding factors affect analyses and how to correct for these factors. In this talk I will discuss several recently developed methods based on linear mixed models for correcting for both known and unknown confounding factors in genetic studies.
Recombination is a fundamental biological process that influences patterns of variation, the efficacy of natural selection and proper segregation of chromosomes during meiosis. In most species, recombination rates are not constant across the genome. Instead, most recombination events tend to occur in narrow regions (generally < 2 Kilobases in size), termed 'recombination hot spots'. We review current methods for identifying recombination hot spots from patterns of genetic variation, and show that the commonly used LDhot has a false discovery rate >> 50%. This has important implications for how to interpret previous studies, and for our understanding of how fine-scale recombination rates evolve over time.
Human populations have undergone dramatic changes in population size in the past 100,000 years, including a severe bottleneck of non-African populations and recent explosive population growth. There is currently great interest in how these demographic events may have affected the burden of deleterious mutations in individuals and the allele frequency spectrum of disease mutations in populations. Here we use population genetic models to show that—contrary to previous conjectures—recent human demography has likely had very little impact on the average burden of deleterious mutations carried by individuals. This prediction is supported by exome sequence data showing that African American and European American individuals carry very similar burdens of damaging mutations. We next consider whether recent population growth has increased the importance of very rare mutations in complex traits. Our analysis predicts that for most classes of disease variants, rare alleles are unlikely to contribute a large fraction of the total genetic variance, and that the impact of recent growth is likely to be modest. However, for diseases that have a direct impact on fitness, strongly deleterious rare mutations likely do play important roles, and the impact of very rare mutations will be far greater as a result of recent growth. In summary, demographic history has dramatically impacted patterns of variation in different human populations, but these changes have likely had little impact on either genetic load or on the importance of rare variants for most complex traits.
Identical by decent (IBD) is the term used to describe segments of DNA that descend from a single ancestral haploid genome (a gamete) to current individuals. Since DNA that is IBD has high probability of being of the same allelic type, IBD provides a basis for modeling similarities among relatives both at the population and pedigree level. Traditionally, IBD has been considered in the pedigree context, but modern genetic data provide opportunities to infer and use IBD among individuals in a population in which pedigree relationships are unknown. Model-based inference can provide appropriate measures of uncertainty, but population-based probability models and methods for inferring IBD are required. Initial analyses have focused on pairs of individuals or haploid gametes, and in that context it has been shown that IBD can be used in the presence of allelic heterogeneity to detect causal genome regions in a case-control study. We present models and methods that extend models and methods to inference of the IBD graph—a dynamic specification of the IBD partitions among multiple gametes across a chromosome. However, implementation of these methods remains challenging. Given a set of realizations of IBD graphs, the next challenge is to use these effectively in the analysis of trait data on members of a population; some preliminary examples will be given.
Parts of this work are joint with Hoyt Koepke, Chaozhi Zheng, Chris Glazner, and Sharon Browning.
Over the past 6 years, over a half-million people have joined 23andMe to learn about their genetics. Most of these people in turn contribute phenotype data to our research efforts. This dataset includes tens of thousands of trios as well as thousands of phenotypes. I'll survey some of the things we've learned, including: how to improve short (2-4cM) IBD segment detection; the admixed ancestry of individuals in the US, detecting local ancestry, the heritability of a wide range of traits, and hundreds of new genetic associations for some interesting conditions (infectious diseases, sleep habits, food preferences, morphology, and more).