Monday, May 1st, 2017

9:30 am10:30 am
One of the core problems of modern statistics and machine learning is to approximate difficult-to-compute probability distributions. This problem is especially important in probabilistic modeling, which frames all inference about unknown quantities as a calculation about a conditional distribution. In this tutorial I review and discuss variational inference (VI), a method a that approximates probability distributions through optimization. VI has been used in myriad applications in machine learning and tends to be faster than more traditional methods, such as Markov chain Monte Carlo sampling.
 
This tutorial aims to provide both an introduction to VI, a modern view of the field, and an overview of the role that probabilistic inference plays in many of the central areas of machine learning. First, I will provide a review of variational inference. Second, I describe some of the pivotal tools for VI that have been developed in the last few years, tools like Monte Carlo gradient estimation, black box variational inference, stochastic variational inference, and variational autoencoders. Finally, I discuss some of the unsolved problems in VI and point to promising research directions.
11:00 am11:45 am
Deep residual networks have been widely adopted for computer vision applications because they exhibit fast training, even for very deep networks.  The key difference between these networks and other feedforward networks is that a residual network with near-zero parameters computes a near-identity function, rather than a near-zero function. We consider mappings of this kind: compositions of near-identity functions. Rather than fixing a finite parameterization, such as fixed-size ReLU or sigmoid layers, we allow arbitrary Lipschitz deviations from the identity map; these might be Lipschitz-constrained sigmoid layers of unbounded width, for instance.  We investigate representational and optimization properties of these near-identity compositions. In particular, we show that a smooth bi-Lipschitz function can be represented exactly as a composition of functions that are Lipschitz-close to the identity, and greater depth allows a smaller Lipschitz constant.  We also consider the optimization problem that arises from regression with a composition of near-identity nonlinear maps. For functional gradients with respect to these nonlinear maps, we show that any critical point of a convex objective in the near-identity region must be a global minimizer.
 
Joint work with Steve Evans and Phil Long.
11:45 am12:30 pm

How to combine the complementary strengths of probabilistic graphical models and neural networks? We compose latent graphical models with neural network observation likelihoods. For inference, we use recognition networks to produce local evidence potentials, then combine them using efficient message-passing algorithms. All components are trained simultaneously with a single stochastic variational inference objective. We use this framework to automatically segment and categorize mouse behavior from raw depth video.

2:00 pm2:45 pm

For scientific applications involving large and complex data, it is useful to develop probability models and inference methods but issues arise in terms of computing speed and stability. In such settings accurate uncertainty quantification is critical and common fast approximations often do a poor job of UQ. In this talk, I discuss recent developments in scaling up sampling algorithms to big problems - considering broad classes of approximate and parallelizable MCMC algorithms with corresponding theory guarantees. A particular focus will be on data augmentation approaches. A variety of applications will be provided as motivation.

2:45 pm3:30 pm

Many empirical questions about machine learning systems are best answered through the eigenvalues of associated matrices.  For example, the Hessian of the loss function for a large deep network gives us insight into the difficulty of optimization and sensitivity to parameters.  The spectra of adjacency and Laplacian matrices of large graphs helps us understand the global structure between vertices.  Unfortunately, in the most interesting situations, these matrices are too large to be explicitly instantiated, to say nothing of diagonalizingthem directly; rather they are implicit matrices, which can only be interrogatedvia matrix-vector products.  In this work I will discuss how several differentrandomized estimation tricks can be assembled to construct unbiased estimatorsof the spectra of large implicit matrices via generalized traces.

4:00 pm4:45 pm
Many recent Markov chain Monte Carlo (MCMC) samplers leverage continuous dynamics to define a transition kernel that rapidly explores a target distribution. In tandem, a focus has been on devising scalable variants that use stochastic gradients in the dynamic simulations. However, such stochastic gradient MCMC (SG-MCMC) samplers have lagged behind their full-data counterparts in terms of the complexity of dynamics considered since proving convergence in the presence of the stochastic gradient noise is nontrivial.
 
In this talk, we first present a general recipe for constructing SG-MCMC samplers that translates the task of finding a valid sampler into one of choosing two matrices.  Importantly, any continuous Markov process sampling from the target distribution can be written in our framework.  We then turn our attention to MCMC techniques based on jump processes, and propose an algorithm akin to Metropolis Hastings---with the same ease of implementation---but allowing for irreversible dynamics.  Finally, we describe how SG-MCMC algorithms can be applied to applications involving dependent data, where the challenge arises from the need to break the dependencies when considering minibatches of observations.  We propose an algorithm that harnesses the inherent memory decay of the process and provably leads to the correct stationary distribution.
 
We demonstrate our methods on a number of large-scale applications, including a streaming Wikipedia LDA analysis and segmentation of a lengthy genome sequence and ion channel recording.

Tuesday, May 2nd, 2017

9:30 am10:30 am
Many new theoretical challenges have arisen in the area of gradient-based optimization for large-scale statistical data analysis, driven by the needs of applications and the opportunities provided by new hardware and software platforms. I discuss several recent results in this area, including: (1) a new framework for understanding Nesterov acceleration, obtained by taking a continuous-time, Lagrangian/Hamiltonian perspective, (2) a general theory of asynchronous optimization in multi-processor systems, (3) a computationally-efficient approach to variance reduction, and (4) a primal-dual methodology for gradient-based optimization that targets communication bottlenecks in distributed systems.
11:00 am11:45 am
We introduce the Geodesic Walk for sampling Riemannian manifolds and apply it to the problem of generating uniform random points from the interior of a convex polytope in n-dimensional Euclidean space. The walk is a simulation of a stochastic differential equation defined using a convex barrier function; each step is the solution of an ordinary differential equation. It improves the time complexity of sampling polytopes and is the first walk to achieve sub-quadratic complexity. Part of the talk will be spent introducing relevant aspects of manifolds, stochastic processes, efficient solution of differential equations, and how this walk overcomes the bottlenecks of random walks in Euclidean space. No background in string theory (or Riemannian geometry) will be assumed. 
Joint work with Santosh Vempala.
11:45 am12:30 pm
I will describe recent work on algorithms for ensuring generalization when random samples are reused to perform multiple analyses adaptively. I will also discuss connections to the problem of understanding generalization of algorithms for stochastic convex optimization and some challenging open problems.
2:00 pm2:45 pm
We prove that gradient descent efficiently converges to the global optimizer of the maximum likelihood objective of an unknown linear time-invariant dynamical system from a sequence of noisy observations generated by the system. Even though objective function is non-convex, we provide polynomial running time and sample complexity bounds under strong but natural assumptions. Linear systems identification has been studied for many decades, yet, to the best of our knowledge, these are the first polynomial guarantees for the problem we consider.
 
Joint work with Tengyu Ma and Ben Recht.
2:45 pm3:30 pm
We study the question of what functions can be learned by neural network training algorithms, and prove surprisingly general lower bounds in the realizable case, where the function to be learned can be represented exactly as a neural network. Previous lower bounds apply only for special activation functions or unnatural input distributions, or for highly restricted classes of algorithms. We prove lower bounds for learning a family of functions realizable as small neural networks with a single hidden layer, with any activation function from a broad family (including sigmoid and ReLU) over any ``nice'' input distribution (e.g., Gaussian or uniform). Our results apply to any statistical query algorithm, including all known neural network training algorithms, regardless of the model architecture, loss function, or optimization routine.
 
Joint work with Le Song, Santosh Vempala, and Bo Xie
4:00 pm4:45 pm
Deep learning is the state of the art in domains such as computer vision and natural language understanding. MXNet is an open-source framework developed through collaboration and contributions of researchers at several universities. It is now the recommended deep learning package at AWS due to its programmability, portability, and efficiency.  It is suitable for deployment from multiple GPUs and multiple machines to embedded systems such as smart phones and embedded GPUs.   This talk will cover how (i) MXNet has flexible programming paradigms, (ii) MXNet can enable memory-computation tradeoffs and support portability and (iii) MXNet has the highest efficiency on hundreds of GPUs. I will also demonstrate how you can quickly start using MXNet by leveraging preconfigured Deep Learning AMIs (Amazon Machine Images) and CloudFormation Templates on AWS.

Wednesday, May 3rd, 2017

9:30 am10:30 am
We will review multiple ways of acquiring healthcare data and the machine learning methods which are currently being used for analysis. These include: 1) Disease trends and other predictions from electronic health record (EHR) data 2) Mobile apps and devices for improving the granularity of recorded health data and 3) The combination of mobile app and social network information. Current work in these areas will be presented and the future of machine learning contributions to the field will be discussed.
 
11:00 am11:45 am

We present a model for clustering which combines two criteria: Given a collection of objects with pairwise similarity measure, the problem is to find a cluster that is as dissimilar as possible from the complement, while having as much similarity as possible within the cluster. The two objectives are combined either as a ratio or with linear weights. The ratio problem, and its linear weighted version, are solved by a combinatorial algorithm within the complexity of a single minimum s,t-cut algorithm. This problem (HNC) is closely related to the NP-hard problem of normalized cut that is often used in image segmentation and for which heuristic solutions are generated with the eigenvector technique (spectral method).

HNC has been used, as a supervised or unsupervised machine learning technique. In an extensive comparative study HNC was compared to leading machine learning techniques on benchmark datasets.  The study indicated that HNC and other similarity-based algorithms provide robust performance and improved accuracy as compared to other techniques.  Furthermore, the technique of "sparse-computation" enables the scalability  of HNC and similarity-based algorithms to large scale data sets.

11:45 am12:30 pm

The development of algorithms for hierarchical clustering has been hampered by a shortage of precise objective functions. To help address this situation, we introduce a simple cost function on hierarchies over a set of points, given pairwise similarities between those points. We show that this criterion behaves sensibly in canonical instances and that it admits a top-down construction procedure with a provably good approximation ratio.

We show, moreover, that this procedure lends itself naturally to an interactive setting in which the user is repeatedly shown snapshots of the hierarchy and makes corrections to these.

2:00 pm2:45 pm

Structured prediction is the task of jointly predicting correlated outputs; the core computational challenge is how to do so in time that is not exponential in the number of outputs without making assumptions on the correlation structure (such as linear chain assumptions). I will describe recent work we have done in developing novel algorithms based on an imitation learning view of the structured prediction problem, whose core goal is efficiency. I'll provide a unified analysis of a large family of approaches. I'll conclude with some recent work connecting these ideas to sequential models in deep learning, such as recurrent neural networks, and how to modify these approaches for a bandit feedback setting.

2:45 pm3:30 pm

In Learning Reductions, you reduce to a simpler machine learning problem, apply a solver for that problem, and then use the solution on the simpler problem to solve a more complex problem.  Learning reductions have been used to create exponentially more efficient solutions to Multiclass classification, Contextual Bandit learning and exploration, and Active Learning.   I will discuss several families of learning reductions, their applications, and limits.

4:00 pm5:00 pm

Taking place at Banatao Auditorium, Sutardja Dai Hall

Can machines think? Philosophy and science have long explored this question. Throughout the 20th century, attempts were made to link this question to the latest discoveries -- Goedel's theorem, Quantum Mechanics, undecidability, computational complexity, cryptography etc. Starting in the 1980s, a long body of work led to the conclusion that many interesting approaches—even modest ones—towards achieving AI were computationally intractable, meaning NP-hard or similar. One could interpret this body of work as a "complexity argument against AI."

But in recent years, empirical discoveries have undermined this argument, as computational tasks hitherto considered intractable turn out to be easily solvable on very large-scale instances. Deep learning is perhaps the most famous example.

This talk revisits the above-mentioned complexity argument against AI and explains why it may not be an obstacle in reality. We survey methods used in recent years to design provably efficient (polynomial-time) algorithms for a host of intractable machine learning problems under realistic assumptions on the input. Some of these can be seen as algorithms to extract semantics or meaning out of data.

Light refreshments will be served before the lecture at 3:30 p.m.

Thursday, May 4th, 2017

9:30 am10:30 am

Many big data analytics problems are intrinsically complex and hard, making the design of effective and scalable algorithms very challenging. Domain experts need to perform extensive research, and experiment with many trial-and-errors, in order to craft approximation or heuristic schemes that meet the dual goals of effectiveness and scalability. Very often, restricted assumptions about the data, which are likely to be violated in real world, are made in order for the algorithms to work and obtain performance guarantees. Furthermore, previous algorithm design paradigms seldom systematically exploit a common trait of real-world problems: instances of the same type of problem are solved repeatedly on a regular basis, differing only in their data. Is there a better way to design effective and scalable algorithms for big data analytics?

I will introduce a framework for addressing this challenge based on the idea of embedding algorithm steps into nonlinear spaces, and learn these embedded algorithms from problem instances via either direct supervision or reinforcement learning. In contrast to traditional algorithm design where every steps in an algorithm is prescribed by experts, the embedding design will delegate some difficult algorithm choices to nonlinear learning models so as to avoid either large memory requirement, restricted assumptions on the data, or limited design space exploration. I will illustrate the benefit of this new design framework using large scale real world data, including a materials discovery problem, a recommendation problem over dynamic information networks, and a problem of learning combinatorial algorithms over graphs. The learned algorithms can reduce memory usage and runtime by orders of magnitude, and sometimes result in drastic improvement in predictive performance.

11:00 am11:45 am

We consider the fundamental statistical problem of robustly estimating the mean and covariance of a distribution from i.i.d. samples in n-dimensional space, i.e. in the presence of arbitrary (malicious) noise, with no assumption on the noise. Classical solutions (Tukey point, geometric median) are either intractable to compute efficiently or produce estimates whose error scales as a power of the dimension. In this talk, we present efficient and practical algorithms to compute estimates for the mean, covariance and operator norm of the covariance matrix, for a wide class of distributions. We show that the estimate of the algorithm is higher than the information-theoretic lower bound by a factor of at most the square-root of the logarithm of the dimension. This gives polynomial-time solutions to some basic questions studied in robust statistics. One of the applications is agnostic SVD.

11:45 am12:30 pm
We consider the following basic problem: Given corrupted samples from a high-dimensional Gaussian, can we efficiently learn its parameters? This is the prototypical question in robust statistics, a field that took shape in the 1960's with the pioneering works of Tukey and Huber. Unfortunately, all known robust estimators are hard to compute in high dimensions. This prompts the following question: Can we reconcile robustness and computational efficiency in high-dimensional learning?
 
In this talk, I will describe the first efficient algorithms for robustly learning a high-dimensional Gaussian that are able to tolerate a constant fraction of corruptions. More broadly, I will present a set of algorithmic techniques that yield efficient robust estimators for high-dimensional models under fairly general conditions.
2:45 pm3:30 pm

I will talk about the polytope learning problem: Given random samples from a polytope in the n-dimensional space with poly(n) vertices, can we efficiently find an approximation to the polytope? (Information theoretically, poly(n) many samples suffice.) I will discuss some approaches and related conjectures for this problem, and also its relation to other problems in learning theory.

Friday, May 5th, 2017

9:30 am10:30 am
Classical experimental design problems in statistics tend to have combinatorial solutions. We consider computationally tractable methods for the experimental design problem, where k out of n design points of dimension p are selected so that certain optimality criteria are approximately satisfied. We prove a constant approximation ratio under a very weak condition that k > 2p, and a (1 + eps) relative approximation ratio under
slightly stronger conditions in which k is still a linear function of p. Numerical results on both synthetic and real-world design problems verify the practical effectiveness of the proposed algorithm.
11:00 am11:45 am

We consider the problems of estimation, learning, and optimization over  a large dataset of which a subset consists of points drawn from the distribution of interest, and we make no assumptions on the remaining points---they could be well-behaved, extremely biased, of adversarially generated.  We investigate this question via  two models for studying robust estimation, learning, and optimization.  One of these models, which we term the "semi-verified" model, assumes access to a second much smaller (typically constant-sized) set of "verified" or trusted data points that have been drawn from the distribution of interest. The key question in this model is how a tiny, but trusted dataset can allow for the accurate extraction of the information contained in the large, but untrusted dataset.  The second model, "list-decodable learning", considers the task of returning a small list of proposed answers.  Underlying this model is the question of whether the structure of "good" datapoints can be overwhelmed by the remaining data--surprisingly, the answer is often ``no''.   We present several strong algorithmic results for these models, for a large class of mathematically clean and practically relevant robust estimation and learning tasks.

The talk is based on several joint works with Jacob Steinhardt and Moses Charikar, and with Michela Meister.

11:45 am12:30 pm

In the age of big data, communication is sometimes more of a bottleneck than computation. The main model of distributed computation used for machine learning for big data is the map-reduce model which gives the programmer a transparent way to scale up computation to a large number of processors.  One drawback of the map-reduce model is its reliance on boundary synchronization and on shuffling. Both of these mechanisms create a communication bottleneck which becomes more severe the more computers are added to the cluster. In this talk I propose a new model of computation which I call “tell me something new”. TMSN is an asynchronous model of computation which optimizes communication. Rather than communicating in times dictated by a common clock, each processor broadcasts information when they have something important to say. I show how this TMSN can be used to distribute boosting, K-means++ and stochastic gradient descent. I will also show that TMSN is a natural fit for adaptive sensor networks.

Monday, September 11th, 2017

9:30 am10:30 am
Speaker: Aravind Srinivasan, University of Maryland

Dependent rounding, originating from the "pipage rounding" work of Ageev and Sviridenko, is now a fundamental algorithmic approach in combinatorial optimization. We survey this area, with an emphasis on the types of correlation inequalities that are achievable/desirable, and applications thereof. Topics covered include connections with matroids, matroid intersection, and scheduling, as well as how to handle situations where some amount of positive correlation is inevitable: the last part is recent joint work with David Harris, Thomas Pensyl, and Khoa Trinh, and arises in facility-location problems.