Labeling a Data Set Using Sublinearly Many Queries
Sanjoy Dasgupta (UC San Diego)
We consider algorithms that take an unlabeled data set and label it in its entirety, given the ability to interact with a human expert. The goal is to minimize the amount of interaction while producing a labeling that satisfies an (epsilon, delta) guarantee: with probability at least 1-delta over the randomness in the algorithm, at most an epsilon fraction of the labels are incorrect. Scenario 1: The algorithm asks the expert for labels of specific points. This is the standard problem of active learning, except that the final product is a labeled data set rather than a classifier. Scenario 2: The expert also provides "weak rules" or helpful features. We will summarize the state of the art on these problems, in terms of promising algorithms and statistical guarantees, and identify key challenges and open problems.