Monday, March 4th, 2019

9:30 am10:30 am
Speaker: John Abowd (U.S. Census Bureau; Cornell University)

The Fundamental Law of Information Reconstruction, a.k.a. the Database Reconstruction Theorem, exposes a vulnerability in the way statistical agencies have traditionally published data. But it also exposes the same vulnerability for the way Amazon, Apple, Facebook, Google, Microsoft, Netflix, and other Internet giants publish data. We are all in this data-rich world together. And we all need to find solutions to the problem of how to publish information from these data while still providing meaningful privacy and confidentiality protections to the providers.

Fortunately for the American public, the Census Bureau's curation of their data is already regulated by a very strict law that mandates publication for statistical purposes only and in a manner that does not expose the data of any respondent--person, household or business--in a way that identifies that respondent as the source of specific data items. The Census Bureau has consistently interpreted that stricture on publishing identifiable data as governed by the laws of probability. An external user of Census Bureau publications should not be able to assert with reasonable certainty that particular data values were directly supplied by an identified respondent. Traditional methods of disclosure avoidance now fail because they are not able to formalize and quantify that risk. Moreover, when traditional methods are assessed using current tools, the relative certainty with which specific values can be associated with identifiable individuals turns out to be orders of magnitude greater than anticipated at the time the data were released.

In light of these developments, the Census Bureau has committed to an open and transparent modernization of its data publishing systems using formal methods like differential privacy. The intention is to demonstrate that statistical data, fit for their intended uses, can be produced when the entire publication system is subject to a formal privacy-loss budget.

To date, the team developing these systems--many of whom are in this room--has demonstrated that bounded \epsilon-differential privacy can be implemented for the data publications from the 2020 Census used to re-draw every legislative district in the nation (PL94-171 tables). That team has also developed methods for quantifying and displaying the system-wide trade-offs between the accuracy of those data and the privacy-loss budget assigned to the tabulations. Considering that work began in mid-2016 and that no organization anywhere in the world has yet deployed a full, central differential privacy system, this is already a monumental achievement.

But it is only the tip of the iceberg in terns of the statistical products historically produced from a decennial census. Demographic profiles, based on the detailed tables traditionally published in summary files following the publication of redistricting data, have far more diverse uses than the redistricting data. Summarizing those use cases in a set of queries that can be answered with a reasonable privacy-loss budget is the next challenge. Internet giants, businesses and statistical agencies around the world should also step-up to these challenges. We can learn from, and help, each other enormously.

11:00 am11:45 am
Speaker: Dan Kifer (Pennsylvania State University)

No abstract available.

11:45 am12:30 pm
Speaker: Christopher Clifton (Purdue University) 

Several aspects of the current American Community Survey processing pipeline pose challenges for differential privacy.  We will briefly review some of these challenges, and in particular look at missing data imputation.  We present a method for differentially private nearest-neighbor imputation that addresses one of these challenges based on the concept and method of smooth sensitivity.

This work supported by the U.S. Census Bureau under CRADA CB16ADR0160002. The views and opinions expressed in this talk are those of the authors and not the U.S. Census Bureau.

2:00 pm2:45 pm
Speaker: Alexandra Wood (Harvard University)

No abstract available.

2:45 pm3:30 pm
Speaker: Felix Wu (Cardozo School of Law)

No abstract available.

4:00 pm4:45 pm
Speaker: James Honaker (Harvard University)

We describe a role for differential privacy in open data repositories handling sensitive data. Archival repositories in the human sciences balance discoverability and replicability with their legal liabilities and ethical constraints to protect sensitive information. The ability to explore differentially private releases of archived data allows a curve-bending change in this trade-off. We further describe PSI, an implementation of a curator system for differentially private queries and statistical models, and its integration with the Dataverse repository. We describe some of the pragmatics of implementing a general purpose curator that works across a wide variety of types of data and types of uses, and of presenting differential privacy to an applied audience new to these concepts.

4:45 pm5:30 pm
Speaker: Mayank Varia (Boston University)

In this talk, I will describe some of the many efforts conducted in pursuit of the Modular Approach to Cloud Security project, an NSF Frontier effort to develop techniques and tools for building information systems with meaningful multi-layered security guarantees. Arguably, reasoning about all the security aspects of systems in one blow is not feasible. The approach we take is thus modular: We aim at systems that are built from smaller and separable functional components, where the security of each component is asserted individually, and where security of the system as a whole can be derived from the security of its components. Because this composition can be performed in multiple ways, we also aim to inform and empower consumers so they can personalize the security guarantees offered by the cloud to suit their objectives.

Tuesday, March 5th, 2019

9:00 am9:45 am
Speaker: Úlfar Erlingsson (Google Brain)

For the last several years, Google has been leading the development and real-world deployment of state-of-the-art, practical techniques for learning statistics and ML models with strong privacy guarantees for the data involved.  I'll give an overview of our work, and the practical techniques we've developed for training Deep Neural Networks with strong  privacy guarantees.  In particular, I'll cover recent results that show how local differential privacy guarantees can be strengthened by the addition of anonymity, and explain the motivation for that work. I'll also cover recent work on uncovering and measuring privacy problems due to unintended memorization in machine learning models.

9:45 am10:30 am
Speaker: Ryan Rogers (Apple)

Federated learning has become an exciting direction for both research and practical training of models with user data. Although data remains decentralized in federated learning, it is common to assume that the model updates are sent in the clear from the devices to the server. Differential privacy has been proposed as a way to ensure the model remains private, but this does not address the issue that model updates can be seen on the server, and lead to leakage of user data. Local differential privacy is one of the strongest forms of privacy protection so that each individual's data is privatized. However, local differential privacy, as it is traditionally used, may prove to be too stringent of a privacy condition in many high dimensional problems, such as in distributed model fitting. We propose a new paradigm for local differential privacy by providing protections against certain adversaries. Specifically, we ensure that adversaries with limited prior information cannot reconstruct, with high probability, the original data within some prescribed tolerance. This interpretation allows us to consider larger privacy parameters. We then design (optimal) DP mechanisms in this large privacy parameter regime. In this work, we combine local privacy protections along with central differential privacy to present a practical approach to do model training privately. Further, we show that these privacy restrictions maintain utility in image classification and language models that is comparable to federated learning without these privacy restrictions.

11:00 am11:45 am
Speaker: Borja Balle (Amazon)

Differential privacy is a mathematical framework for privacy-preserving data analysis. Changing the hyper-parameters of a differentially private algorithm allows one to trade off privacy and utility in a principled way. Quantifying such trade-off in advance is essential to decision-makers tasked with deciding how much privacy can be provided in a particular application while keeping acceptable utility. Analytical utility guarantees offer a simple tool to reason about this trade-off, but they are generally only available for relatively simple problems. For more complex tasks, including the training of neural networks under differential privacy, the utility achieved by a given algorithm can only be measured empirically. This paper presents a Bayesian optimization methodology for efficiently characterizing the privacy-utility trade-off of any differentially private algorithm using only empirical measurements of its utility. The versatility of our methods is illustrated on a number of machine learning tasks involving multiple models, optimizers, and datasets.

11:45 am12:30 pm
Speaker: Divesh Srivastava (AT&T)

No abstract available.

2:00 pm2:45 pm
Speaker: Dave Sands (Chalmers University of Technology)

Information flow properties, such as differential privacy, are subtle, and systems which are intended to enforce them can be tricky to get right. In this talk I will describe some lessons learned in attempting to derive faithful models of information flow systems, taking examples from our work on the ProPer system, a PINQ-like API which uses personalised privacy budgets and provenance tracking to enforce differential privacy.

2:45 pm3:30 pm
Speaker: Danfeng Zhang (Pennsylvania State University)

Differential privacy provides a mathematical definition for the privacy loss to individuals when aggregated data is released. Unfortunately, the growing popularity of differential privacy is accompanied by an increase in the development of incorrect algorithms. Hence, formal verification of differential privacy is critical to its success.

In this talk, I will present LightDP, a simple imperative language for proving differential privacy for sophisticated algorithms. LightDP is built on a proof technique, called randomness alignment, which can be automated via a relational type system assuming little annotation in the source code. The type system is powerful enough to verify sophisticated algorithms, such as Sparse Vector, where the composition theorem falls short. Moreover, the type system converts differential privacy into a safety property in a target program where the privacy cost is explicit. This novel feature makes it possible to verify the target program by off-the-shelf verification tools. Finally, I will present a recent extension of LightDP that enables the verification of Report Noisy Max based on relational types. With the extension, we show that various sophisticated algorithms can be type-checked and verified in seconds.

3:30 pm4:15 pm
Speaker: Aws Albarghouthi (University of Wisconsin, Madison)

I will describe our recent efforts in automatically verifying correctness and accuracy of differentially private algorithms. I will show how traditional techniques from logic-based software verification can be extended to the probabilistic setting, by synthesizing appropriate axiomatizations of the probabilistic semantics. The result is a powerful technique that can automatically prove the correctness and accuracy of sophisticated algorithms from the differential privacy literature.

4:45 pm5:45 pm
Speaker: John Abowd (U.S. Census Bureau; Cornell University), Úlfar Erlingsson (Google Brain), Aleksandra Korolova (University of Southern California), Ian Schmutte (University of Georgia), Alexandra Wood (Harvard University)

 . 

Wednesday, March 6th, 2019

9:00 am9:45 am
Speaker: John Friedman (Brown University)

We develop a simple method to reduce privacy loss when disclosing statistics such as OLS regression estimates based on samples with small numbers of observations. We focus on the case where the dataset can be broken into many groups (“cells”) and one is interested in releasing statistics for one or more of these cells. Building on ideas from the differential privacy literature, we add noise to the statistic of interest in proportion to the statistic’s maximum observed sensitivity, defined as the maximum change in the statistic from adding or removing a single observation across all the cells in the data. Although not provably private, our method generally outperforms widely used methods of disclosure limitation such as count-based cell suppression both in terms of privacy loss and statistical bias. We illustrate how the method can be implemented by discussing how it was used to release estimates of social mobility by Census tract in the Opportunity Atlas. We also provide a step-by-step guide and illustrative Stata code to implement our approach.

9:45 am10:30 am
Speaker: Li Xiong (Emory University)

From using health data for medical research to using location traces for location-based services, the seemingly different applications have one characteristic in common – the data are spatiotemporally correlated. Such correlations, if not modeled and addressed carefully, challenge the utility of traditional differential privacy mechanisms and even the privacy guarantee of standard definitions. For aggregate health data learning and release, I will present case studies of applying differential privacy for deep learning and computational phenotyping using wearable data and Electronic Health Records (EHRs) considering their spatiotemporal characteristics, and a study quantifying privacy leakage of traditional data release mechanisms under spatiotemporal correlations. For individual location-based services, I will present new privacy notions and mechanisms extending traditional differential privacy for location and spatiotemporal event protection under spatiotemporal correlations. I will conclude the talk with a discussion of open challenges.

11:00 am11:45 am
Speaker: Sharad Mehrotra (UC Irvine)

In this talk, I will discuss our experience building a campus-level privacy preserving smartspace testbed entitled TIPPERS. TIPPERS is an IoT data management middleware that supports a plug-n-play architecture to integrate diverse privacy technologies.  It supports differential privacy, policy-based data processing, secure data management, and privacy preferences of users. In this talk, I will describe some of the privacy challenges we have identified in building and deploying TIPPERS at the campus level, and highlight limitations of existing technologies we have observed and research on scaling the technologies from demo to real deployment.

11:45 am12:30 pm
Speaker: Gillian Raab (University of Edinburgh & Administrative Data Research Centre Scotland)

No abstract available.

2:00 pm2:45 pm
Speaker: Aloni Cohen (Massachusetts Institute of Technology)

There is a significant conceptual gap between legal and mathematical thinking around data privacy. The effect is uncertainty as to the which technical offerings adequately match expectations expressed in legal standards. The uncertainty is exacerbated by a litany of successful privacy attacks, demonstrating that traditional statistical disclosure limitation techniques often fall short of the sort of privacy envisioned by legal standards.

We define predicate singling out, a new type of privacy attack intended to capture the concept of singling out appearing in the General Data Protection Regulation (GDPR).

Informally, an adversary predicate singles out a dataset X using the output of a data release mechanism M(X) if it manages to a predicate p matching exactly one row in X with probability much better than a statistical baseline. A data release mechanism that precludes such attacks is secure against predicate singling out (PSO secure).

We argue that PSO security is a mathematical concept with legal consequences. Any data release mechanism that purports to "render anonymous'' personal data under the GDPR must be secure against singling out, and hence must be PSO secure. We then analyze PSO security, showing that it fails to self-compose. Namely, a combination of $\omega(\log n)$ exact counts, each individually PSO secure, enables an attacker to predicate single out. In fact, the composition of just two PSO secure mechanisms can fail to provide PSO security.

Finally, we ask whether differential privacy and k-anonymity are PSO secure. Leveraging a connection to statistical generalization, we show that differential privacy implies PSO security. However, k-anonymity does not: there exists a simple and general predicate singling out attack under mild assumptions on the k-anonymizer and the data distribution.

2:45 pm3:30 pm
Speaker: Christine Task (Knexus Research Corp)

Knexus Research is a small R&D company located in the DC area at National Harbor, MD.   We have over 13 years of experience moving research in AI and Data Science through the math, science and engineering steps necessary to make the jump from theory to practice.   We have three active projects in data privacy: As technical lead for the NIST Differentially Private Synthetic Data Challenge, Knexus is providing oversight and technical direction for the first national challenge in Differential Privacy.   For the US Census Bureau, the Knexus CenSyn team is providing evaluation, research, engineering and production software development support for Census privacy efforts.   And, with the DARPA Brandeis Program, Knexus is tackling the problem of developing noise-resistant decision metrics that can operate reliably over privatized data in critical contexts such as crisis detection.   At Knexus, our priority is to develop the tools and technologies necessary to facilitate safe, successful public adoption of new research concepts.  In this talk we'll describe our ongoing work and share lessons we've learned managing the challenges of data privacy applications.


 

3:30 pm4:15 pm
Speaker: Gerome Miklau (University of Massachusetts Amherst)

Differential privacy protects individuals against privacy harms, but that protection comes at a cost to accuracy of released data.  We consider settings in which released data is used to decide who (i.e. which groups of people) will receive a desired resource or benefit.  We show that if decisions are made using data released privately, the noise added to achieve privacy may disproportionately impact some groups over others.  Thus, while differential privacy offers equal privacy protection to participating individuals, it may result in disparities in utility across groups, with potentially serious consequences for affected individuals.The talk will explain how disparities in accuracy can be caused by commonly-used privacy mechanisms and highlight some of the social choices that arise in the design and configuration of privacy mechanisms.


Based on joint work with Ashwin Machanavajjhala, Michael Hay, Ryan McKenna, David Pujol, Satya Kuppam.

Thursday, March 7th, 2019

9:00 am9:45 am
Speaker: Jörg Drechsler (Institute for Employment Research)

While the concept of differential privacy generated a lot of interest not only among computer scientists, but also in the statistical community, actual applications of differential privacy at statistical agencies have been sparse until recently. This changed, when the U.S. Census Bureau announced that some of the flagship products of the Bureau (most notably the 2020 Census) will be protected using mechanisms fulfilling the requirements of differential privacy.

A key challenge which needs to be addressed when looking at the problem from a statistical perspective is how to obtain statistically valid inferences from the protected data, i.e. how to take the extra uncertainty from the protection mechanism into account. In this talk I will present some ideas to address this problem. The methodology will be illustrated using administrative data gathered by the German Federal Employment Agency. Detailed geocoding information has been added to this database recently and plans call for making this valuable source of information available to the scientific community. I will discuss which steps are required to generate privacy protected microdata for this database and illustrate, which problems arise in practical settings when following the recommendations in the literature for generating differentially private microdata. I will propose some strategies to overcome these limitations and show how valid inferences for means and totals can be obtained from the protected dataset.

9:45 am10:30 am
Speaker: Joshua Snoke (RAND Corporation)

We propose a method to release differentially private synthetic datasets using any parametric synthesizing model. Synthetic data, one of the popular methods in the disclosure control literature and among statistical agencies, releases alternative data values in place of the original ones. We guarantee ε-DP by sampling the parameters for the synthesizing distribution from the exponential mechanism, and we produce synthetic data that maximizes the distributional similarity of the synthetic data relative to the original data using a measure known as the pMSE. The flexibility of the framework allows for a variety of modeling choices given schematic or prior information. We also relax common DP assumptions concerning the distribution and boundedness of the original data, allowing for the release of differentially private data that is unbounded or continuous without additional assumptions. We prove theoretical results for the privacy guarantee, give simulation results for the accuracy of linear regression coefficients, and discuss two computational limitations.

11:00 am11:45 am
Speaker: Ashwin Machanavajjhala (Duke University)

With the rising awareness of reconstruction attacks and the limitations of ad hoc disclosure limitation techniques, there is growing interest in deploying differentially private solutions to release summaries over personal data in several domains. This talk will outline a checklist for a successful deployment of differential privacy based on my recent experiences deploying differential privacy, with takeaways for practitioners and open challenges for privacy researchers.

11:45 am12:30 pm
Speaker: Natalie Shlomo (University of Manchester)

This talk will start with an overview of the traditional statistical disclosure limitation (SDL) framework implemented at statistical agencies for standard outputs, including types of disclosure risks, how disclosure risk and information loss are quantified, and some common SDL methods. Traditional SDL approaches were developed to protect against the risk of re-identification which is grounded in legislation and statistics acts. In recent years, however, we have seen the digitalization of all aspects of our society leading to new and linked data sources offering unprecedented opportunities for research and evidence-based policies. These developments have put pressure on statistical agencies to provide broader and more open access to their data. On the other hand, with detailed personal information easily accessible from the internet, traditional SDL methods for protecting individuals may no longer be sufficient and this has led to agencies relying more on restricting and licensing data. With increasing demands for more open and accessible data, the  disclosure risk of concern for statistical agencies has shifted from  the risk of re-identification  to inferential disclosure where  confidential information may be revealed exactly or to a close approximation. Statistical agencies are now revisiting their intruder scenarios and types of disclosure risks and assessing new privacy models with more rigorous data protection (perturbative) mechanisms for more open strategies of dissemination. Statisticians are now  investigating the possibilities of incorporating Differential Privacy (Dwork, et al 2006) into their SDL framework, especially for web-based dissemination applications where outputs are generated and protected on-the-fly without the need for human intervention to check for disclosure risks. We discuss these dissemination strategies and the potential for Differential Privacy to provide privacy guarantees against inferential disclosure.

2:00 pm2:45 pm
Speaker: Micah Altman (Massachusetts Institute of Technology)

In this talk, we describe work in progress that aims to align emerging methods of data protections with research uses. We use the American Community Survey as an exemplar case for examining the range of ways that government data is used for research. We identify the range of research uses by combining evidence of use from multiple sources including research articles; national and local media coverage; social media; and research proposals. We then employ human and computer-assisted coding methods to characterize the range of data analysis methodologies that researchers employ. Then, building on previous work that surveys and characterizes computational and technical controls for privacy, we match these methods to available and emerging privacy and data security controls. Our preliminary analysis suggests that tiered-access to government data will be necessary to support current and new research in the social and health sciences.

Joint work with Cavan Capps, Dylan Sam, Zachary Lizee.

2:45 pm3:30 pm
Speaker: Helen Nissenbaum (Cornell Tech)

As a philosophical theory, contextual integrity has been persuasive. With some degree of success, it has informed law and policy both directly and indirectly. An open challenge is how well it can be operationalized in practice. This talk will survey a few promising collaborative research projects in social science, theory, and design (systems) that have taken up this challenge.

3:30 pm4:15 pm
Speaker: Joshua Baron (DARPA)

No abstract available.

4:45 pm5:45 pm
Speaker: Jörg Drechsler (Institute for Employment Research), Frauke Kreuter (University of Maryland), Natalie Shlomo (University of Manchester)

 . 

Friday, March 8th, 2019

9:00 am9:45 am
Speaker: Michael Hay (Colgate University)

It is difficult for data analysts to successfully incorporate differential privacy into their applications. Simple techniques are easily implemented but often yield high error rates.  While sophisticated techniques exist, the analyst must not only find them in the vast privacy literature, but implement them carefully or privacy may be lost.

This talk describes efforts to make privacy technology more accessible, including work on benchmarks (DPComp) and privacy platforms (Ektelo and PrivateSQL).   Benchmarks like DPComp facilitate comparison of state-of-the-art techniques and illuminate privacy-utility trade offs, increasing the transparency of privacy algorithms. Ektelo is a framework that allows analysts to author customized workflows from a collection of privacy-vetted "operators" that embody useful design patterns from the literature. PrivateSQL allows analysts to choose a privacy policy appropriate for their multi-relational schema and then write standard SQL queries, which are automatically rewritten to achieve the desired privacy semantics.

9:45 am10:30 am
Speaker: Ruoxi Jia (UC Berkeley)

With the deployment of large sensor-actuator networks, Cyber-Physical Systems (CPSs), such as smart buildings, smart grids, and transportation systems, are producing massive amounts of data often in real-time. These data are being used collectively to inform decision-making of the entities that engage with the CPSs. However, the collection and analysis of the data present a privacy risk that needs to be addressed. Moreover, the impact of these systems on people's lives requires us to be particularly mindful of the privacy-utility tradeoff when designing privacy mechanisms.

In this talk, I will share two perspectives on mitigating the privacy issues in CPSs. In the first part of the talk, I will discuss how to fairly compensate people for using their private data. I will formalize the notion of the "data value" and present various efficient algorithms to compute it. In the second part of the talk, I will discuss how to accommodate the high demand for data utility through the design of more sophisticated privacy mechanisms. To that end, I will discuss an approach to modeling privacy loss and utility of sensor data collected from CPSs. I will also illustrate the approach via an example of privacy enhancement in smart buildings.

11:00 am11:30 am
Speaker: Arthur Street (CSIRO Data61)

The Australian Bureau of Statistics (ABS) has been ground-breaking over the past several years by providing an online tool for researchers to generate custom contingency tables of Census and other datasets. These tables are confidentialized "on the fly" to maintain privacy.  This has saved considerable time and resources compared to manually creating such tables, and has widened access to ABS data. Recently, my team has collaborated with the ABS to extend on this work by providing such tables to authorized users via an API that will lead to greater sharing and use of high value datasets. This talk will present on our approach, how the ABS is using the API, and challenges involved in extending access to a broader set of users.

11:30 am12:00 pm
Speaker: Gian Pietro Farina (University of Buffalo)

No abstract available.

12:00 pm12:30 pm
Speaker: Gautam Kamath (Massachusetts Institute of Technology)

We present novel, computationally efficient, and differentially private algorithms for two fundamental high-dimensional learning problems: learning a multivariate Gaussian in Rd and learning a product distribution in {0,1}d in total variation distance. The sample complexity of our algorithms nearly matches the sample complexity of the optimal non-private learners for these tasks in a wide range of parameters. Thus, our results show that privacy comes essentially for free for these problems, providing a counterpoint to the many negative results showing that privacy is often costly in high dimensions. Our algorithms introduce a novel technical approach to reducing the sensitivity of the estimation procedure that we call recursive private preconditioning, which may find additional applications.

Based on joint work with Jerry Li, Vikrant Singhal, and Jonathan Ullman.

2:30 pm3:15 pm
Speaker: Salil Vadhan (Harvard University)

I will pitch and try to rally support for launching a community effort to build a system of tools for enabling privacy-protective analysis of sensitive personal data.  Key among them will be an open-source library of algorithms for generating differentially private statistical releases, vetted and cumulated from leading researchers in differential privacy, and implemented for easy adoption by custodians of large-scale sensitive data.  The hope is that this will become a standard body of trusted and open-source implementations of differentially private algorithms for statistical analysis and machine learning on sensitive data.  It will magnify the impact of academic research on differential privacy, by providing a channel that brings algorithmic developments to a wide array of practitioners.

3:15 pm4:00 pm
Speaker: Aleksandra Korolova (University of Southern California)

No abstract available.