Mergeable Summaries and the Data Sketches Library
Mergeable summaries (formalized by Agarwal et al. in PODS 2012) allow one to process many different streams of data independently, and then the summaries computed from each stream can be quickly combined to obtain an accurate summary of various combinations of the datasets (union, intersection, etc.). Among other major benefits, mergeable summaries allow data to be automatically processed in a fully distributed and parallel manner, by partitioning the data arbitrarily across many machines, summarizing each partition, and seamlessly combining the results.
This talk will describe a line of research that has grown out of the development of Data Sketches, an open source library of production-quality implementations of mergeable summaries for basic problems including unique counts, quantiles, frequent items, sampling, and matrix analysis. The library is currently used by several companies and government agencies (Yahoo/Oath, Amazon, Splice Machine, GCHQ, etc.) and enables real-time processing of massive datasets.