Streaming, Sketching and Sufficient Statistics

Lecture 1: Streaming, Sketching and Sufficient Statistics I
Lecture 2: Streaming, Sketching and Sufficient Statistics II

This series of talks was part of the Big Data Boot Camp. Videos for each talk are available through the links above.

Speaker: Graham Cormode, University of Warwick

One natural way to deal with the challenge of Big Data is to make the data smaller. That is, to seek a compact (sublinear) representation of the data so that certain properties are (approximately) preserved. We can think of these as a generalization of sufficient statistics for properties of the data. The area of "streaming algorithms" seeks algorithms which can build such a summary as information is processed incrementally. An important class of streaming algorithms are sketches: carefully designed random projections of the input data that can be computed efficiently under the constraints of the streaming model. These have the attractive property that they can be easily computed in parallel over partitions of the input. They aim at optimizing a variety of properties: the size of the summary; the time required to compute the summary; the number of 'true' random bits required; and the accuracy guarantees that result.

This tutorial will present some of the powerful results that have emerged from these efforts:

sketches for sparse recovery over high dimensional inputs, with applications to compressed sensing and machine learning;
effective sketches for Euclidean space, providing fast instantiations of the Johnson-Lindenstrauss Lemma;
techniques for sampling from the support of a vector, and estimating the size of the support set;
applications of these ideas to large graph computations and linear algebra;
lower bounds on what is possible to summarize, arising from Communication Complexity and Information Complexity.