Learning from Untrusted Data
We consider the problems of estimation, learning, and optimization over a large dataset of which a subset consists of points drawn from the distribution of interest, and we make no assumptions on the remaining points---they could be well-behaved, extremely biased, of adversarially generated. We investigate this question via two models for studying robust estimation, learning, and optimization. One of these models, which we term the "semi-verified" model, assumes access to a second much smaller (typically constant-sized) set of "verified" or trusted data points that have been drawn from the distribution of interest. The key question in this model is how a tiny, but trusted dataset can allow for the accurate extraction of the information contained in the large, but untrusted dataset. The second model, "list-decodable learning", considers the task of returning a small list of proposed answers. Underlying this model is the question of whether the structure of "good" datapoints can be overwhelmed by the remaining data--surprisingly, the answer is often ``no''. We present several strong algorithmic results for these models, for a large class of mathematically clean and practically relevant robust estimation and learning tasks.
The talk is based on several joint works with Jacob Steinhardt and Moses Charikar, and with Michela Meister.