# Genomics Tutorials and Talks

David Woodruff (IBM Almaden)

Calvin Lab Room 116

## Beating CountSketch for Heavy Hitters in Insertion-Only Streams

Given a stream p_1, ..., p_m of items from a universe U, which, without loss of generality we identify with the set of integers {1, 2, ..., n}, we consider the classic problem of finding the most frequent items in the stream. Namely, we consider the problem of returning all l_2-heavy hitters, i.e., those items j for which f_j >= eps sqrt{F_2}, where f_j is the number of occurrences of item j in the stream, and F_2 = sum_{i in [n]} f_i^2 is the second moment of the stream. Such a guarantee is considerably stronger than the l_1-guarantee, which finds those items j for which f_j >= eps m. In 2002, Charikar, Chen, and Farach-Colton suggested the CountSketch data structure, which finds all such j using Theta(log^2 n) bits of space (for constant eps). The only known lower bound is Omega(log n) bits, which comes from the need to specify the identities of the items found.

We show it is possible to achieve O(log n log log n) bits of space for this problem. Our techniques, based on Gaussian processes, lead to a number of new space bounds for other problems in data streams, such as estimating F_2 at all times in the stream using O(log n log log n) bits of space, which improves a natural union bound and an algorithm of Huang, Tai, and Yi. We also show how to estimate the maximum frequency up to additive error eps*sqrt{F_2} using O(log n log log n) bits of space, resolving Open Question 3 from the IITK list of open questions for data streams for insertion only streams.