Monday, April 28th, 2014

Program Retrospective: Theoretical Foundations of Big Data Analysis

By Ben Recht

Although it is impossible to not hear about the importance of “Big Data” in today’s society, it’s often hard to determine what exactly is so exciting about large data sets. “Big Data” is not simply about collecting or storing the largest stockpile of measurements. The “bigness” is not about who can process the most data, but who can get the most out of the data they collect. Each extra bit of information that can be wrung from our collected data can translate into more effective diagnoses of disease, more efficient energy systems, or more precise understanding of our universe. Every computing device should be tuned to make the most of the high volume of data it encounters, and different devices face different bottlenecks.

What is unique about this research challenge is that it is not obviously served by one academic department. Statistics studies uncertainty, but doesn’t necessarily consider the constraints imposed by computing substrates. Computer science understands scaling laws and abstractions, but does not tend to grapple with the issues of unreliability and uncertainty in the data we collect. Application-specific approaches often create niche tools that don’t provide the insights required to generalize to other areas interested in data processing. Thus, one of the biggest challenges of our semester program at the Simons Institute was simply determining how to break down the artificial silos in engineering and operate at the intersection of computation, statistics, and applications.

Every engineering department hosts someone who works on computational mathematics. They all work on very similar problems, but invent completely different terms for analyses. For example, an algorithm for processing streaming data into a consistent estimate is called “least mean squares” by electrical engineers, “online learning” by computer scientists, and the “Kaczmarz method” by imaging scientists. Each of these disciplines brings its own perspective and analyses, and linking these ideas together can only advance our theoretical understanding and practical techniques.

Since our assembled group was so diverse, the “boot camp” format at Simons was essential. We spent a week at the beginning of the semester comparing notes and concepts from data science, and identifying core topics that allowed us to understand the statistical properties of large data sets. We drew connections among our different areas and got a feel for how to communicate effectively despite our varied jargon.

After setting this foundation, the remainder of the program worked its way up through different scales. The first workshop on representation focused on how we describe, summarize, and compress data for analysis. The second workshop focused on computation and scaling algorithms to our current computing substrates. The third workshop examined graphical data and how to use analytics to understand structure and information flow through large-scale networks such as social networks. The semester concluded with a study of privacy, trying to understand how to secure our data while not degrading algorithmic performance.

After seeing this thread of computational and statistical thinking intertwining all different scales of computation and data analysis, it is impossible to ignore that we are in the midst of a revolution in the computational sciences. This semester program laid the foundation for thinking holistically about data processing and provided energy and momentum to continue making advances in this exciting and vital research area.

 

Related Articles:

Letter from the Director
The Renovation of Calvin Lab: A Photo Essay
Current and Upcoming Programs
Program Retrospective: Real Analysis in Computer Science
Life at the Simons Institute: A Fellow's Perspective
Official Opening of the Simons Institute