# UCSF-Simons Seminar Series

Gregory Valiant (Stanford University)

Calvin Lab Room 116

## Accurate Inferences Beyond the Empirical Distribution

We consider the problem of making accurate inferences about a complex distribution, in the regime in which the amount of data (i.e the sample size) is too small for the empirical distribution of the samples to be an accurate representation of the underlying distribution. Our first result is that given n samples drawn from an arbitrary distribution over a discrete support, one can accurately recover the unlabelled vector of probabilities of all domain elements whose true probability is greater than 1/(n log n). Stated differently, one can learn--up to relabelling--the portion of the distribution consisting of elements with probability greater than 1/(n log n). This result has several curious implications, including leading to an optimal algorithm for "de-noising" the empirical distribution of the samples, and implying that one can accurately estimate the number of new domain elements that would be seen given a new larger sample, of size up to n* log n. (Extrapolation beyond this sample size is provable information theoretically impossible, without additional assumptions on the distribution.) While these results are applicable generally, we highlight an adaptation of this general approach to some problems in genomics (e.g. quantifying the number of unobserved protein coding variants), and discuss some future directions.

This talk is based on two joint works, one with Paul Valiant, and one with Paul Valiant and James Zou.