Computation-Statistics Tradeoffs in Unsupervised Learning via Data Summarization

Thursday, March 19th, 2015, 12:00 pm–12:25 pm

Add to Calendar

Event:

Information Theory, Learning and Big Data

Speaker:

Mesrob Ohannessian, UC San Diego

Location:

Calvin Lab Auditorium

Faced with massive data, is it possible to trade off statistical risk and computational time? This challenge lies at the heart of large-scale machine learning. I will show in this talk that we can indeed achieve such risk-time tradeoffs by strategically summarizing the data, in the unsupervised learning problem of probabilistic k-means, i.e. vector quantization. In particular, there exist levels of summarization for which as the data size increases, the running time decreases, while a given risk is maintained. Furthermore, there exists a constructive algorithm that provably finds such tradeoff levels. The summarization in question is based on coreset constructions from computational geometry. I will also show that these tradeoffs exist and may be harnessed for a wide range of real data. This adds data summarization to the list of methods, including stochastic optimization, that allow us to perceive data as a resource rather than an impediment.

Download Video [1.4GB .mp4]