Monday, July 15th, 2019

9:00 am9:45 am
Speaker: Peter Bartlett (UC Berkeley)

Classical theory that guides the design of nonparametric prediction methods like deep neural networks involves a tradeoff between the fit to the training data and the complexity of the prediction rule. Deep learning seems to operate outside the regime where these results are informative, since deep networks can perform well even with a perfect fit to noisytraining data. We investigate this phenomenon of 'benign overfitting' in the simplest setting, that of linear prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of effective rank of the data covariance. It shows that overparameterization is essential: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size.  We discuss implications for deep networks and for robustness to adversarial examples.

Joint work with Phil Long, Gábor Lugosi, and Alex Tsigler.

9:45 am10:30 am
Speaker: Tengyu Ma (Stanford University)

Sample efficiency is a major challenge of applying deep reinforcement learning (RL) techniques to robotics tasks --- existing
algorithms often require a massive amount of interactions with the environment (samples). Promising solutions include model-based
reinforcement learning and imitation learning (IL). However, the theoretical understanding of them is mostly missing in the setting of
continuous and high-dimensional state space and neural network function approximators.

In this talk, I will present recent work on designing principled model-based RL and IL algorithms with theoretical analyses. These
algorithms also empirically outperform prior works on benchmark tasks in sample efficiency.

No prior knowledge of deep reinforcement learning and imitation learning is required. Based on joint works with Nick Landolfi, Yuping Luo, Garrett Thomas, Huazhe Xu, Trevor Darrell, and Yuandong Tian.

11:00 am11:45 am
Speaker: Nadav Cohen (Institute for Advanced Study)

Understanding deep learning calls for addressing the questions of: (i) optimization --- the effectiveness of simple gradient-based algorithms in solving neural network training programs that are non-convex and thus seemingly difficult; and (ii) generalization --- the phenomenon of deep learning models not over fitting despite having many more parameters than examples to learn from. Existing analyses of optimization and/or generalization typically adopt the language of classical learning theory, abstracting away many details on the setting at hand. In this talk I will argue that a more refined perspective is in order, one that accounts for the specific trajectories taken by the optimizer. I will then demonstrate a manifestation of this approach, analyzing the trajectories of gradient descent over linear neural networks. We will derive what is, to the best of my knowledge, the most general guarantee to date for efficient convergence to global minimum of a gradient-based algorithm training a deep network. Moreover, in stark contrast to conventional wisdom, we will see that sometimes, adding (redundant) linear layers to a classic linear model significantly accelerates gradient descent, despite the introduction of non-convexity. Finally, we will show that such addition of layers induces an implicit bias towards low-rank, and by this explain generalization of deep linear neural networks for the classic problem of low-rank matrix recovery.

Works covered in this talk were in collaboration with Sanjeev Arora, Noah Golowich, Elad Hazan, Wei Hu and Yuping Luo.

11:45 am12:30 pm
Speaker: Mikhail Belkin (The Ohio State University)

A model with zero training error is overfit to the training data and will typically generalize poorly" goes statistical textbook wisdom. Yet, in modern practice, over-parametrized deep networks with near perfect fit on training data still show excellent test performance. As I will discuss in the talk, this apparent contradiction is key to understanding the practice of modern machine learning. 

While classical methods rely on a trade-off balancing the complexity of predictors with training error, modern models are best described by interpolation, where a predictor is chosen among functions that fit the training data exactly, according to a certain (implicit or explicit) inductive bias. Furthermore, classical and modern models can be unified within a single "double descent" risk curve, which extends the classical U-shaped bias-variance curve beyond the point of interpolation. This understanding of model performance delineates the limits of the usual ''what you see is what you get" generalization bounds in machine learning and points to new analyses required to understand computational, statistical, and mathematical properties of modern models. 

I will proceed to discuss some important implications of interpolation for optimization, both in terms of "easy" optimization (due to the scarcity of non-global minima), and to fast convergence of small mini-batch SGD with fixed step size.

2:30 pm3:15 pm
Speaker: Yuanzhi Li (Stanford University)

What concept classes can neural networks provably learn, in the distribution-free PAC learning setting? Recently, there is a sequence of works relating the learning process of over-parametrized neural networks to kernel methods, and in particular, the neural tangent kernels. These results establish the theory that neural networks can at least PAC-learn the concept classes learnable by kernel methods. In this talk, I will discuss some recent progress that goes beyond this approach, where one can show that over-parametrized neural networks provably PAC-learn some concept classes with (provably) better generalization than any kernel methods.

3:15 pm4:00 pm
Speaker: Hanie Sedghi (Google Brain)

We prove bounds on the generalization error of convolutional networks. The bounds are in terms of the training loss, the number of
parameters, the Lipschitz constant of the loss and the distance from the weights to the initial weights. They are independent of the number of pixels in the input, and the height and width of hidden feature maps. We present experiments with CIFAR-10 and a scaled-down variant, along with varying hyperparameters of a deep convolutional network, comparing our bounds with practical generalization gaps.
 

Tuesday, July 16th, 2019

9:00 am9:45 am
Speaker: Jason Lee (University of Southern California)

Deep Learning has had phenomenal empirical successes in many domains including computer vision, natural language processing, and speech recognition. To consolidate and boost the empirical success, we need to develop a more systematic and deeper understanding of the elusive principles of deep learning.

In this talk, I will provide analysis of several elements of deep learning including non-convex optimization, overparametrization, and generalization error. First, we show that gradient descent and many other algorithms are guaranteed to converge to a local minimizer of the loss. For several interesting problems including the matrix completion problem, this guarantees that we converge to a global minimum. Then we will show that gradient descent converges to a global minimizer for deep overparametrized networks. Finally, we analyze the generalization error by showing that a subtle combination of SGD, logistic loss, and architecture combine to promote large margin classifiers, which are guaranteed to have low generalization error.

 

9:45 am10:30 am
Speaker: Matus Telgarsky (University of Illinois, Urbana-Champaign)

This talk will survey old connections and also shed new light on the ability of margin maximization methods to not only minimize certain
losses like he exponential loss, but moreover output margin-mazimizing solutions.

Joint work with Ziwei Ji.

11:00 am11:45 am
Speaker: Rong Ge (Duke University)

Mode connectivity (Garipov et al., 2018; Draxler et al., 2018) is a surprising phenomenon in the loss landscape of deep nets. Optima—at least those discovered by gradient-based optimization—turn out to be connected by simple paths on which the loss function is almost constant. Often, these paths can be chosen to be piece-wise linear, with as few as two segments. In this talk we will give mathematical explanations for this phenomenon. In particular, we show that although in many settings not all optima are connected, typical optima that satisfy nice properties (dropout stability and noise stability) are connected. These properties ave previously been identified as part of understanding the generalization properties of deep net.

Based on joint work with Rohith Kuditipudi, Xiang Wang, Holden Lee, Yi Zhang, Zhiyuan Li, Wei Hu and Sanjeev Arora.

11:45 am12:30 pm
Speaker: Ben Recht (UC Berkeley)

Conventional wisdom in machine learning taboos training on the test set, interpolating the training data, and optimizing to high
precision. This talk will present evidence demonstrating that this conventional wisdom is wrong. I will additionally highlight commonly
overlooked phenomena imperil the reliability of current learning systems: surprising sensitivity to how data is generated and significant diminishing returns in model accuracies given increased compute resources. I will close with a discussion of how new best practices to mitigate these effects are critical for truly robust and reliable machine learning.

2:30 pm3:15 pm
Speaker: Yasaman Bahri (Google Brain)

Modern deep neural networks are seemingly complex systems whose design and performance are at present not well-understood. One first step towards demystifying their complexity is to identify limits in the system variables that are relevant to practice, and to study them theoretically if possible. Overparameterization in width is one such limit, with practice suggesting that wider models often do better than their smaller-sized counterparts. I review our past work that studies the limit of infinitely wide networks. This includes connecting the prior over functions in deep neural networks with Gaussian processes for fully-connected and convolutional architectures; we further consider Bayesian inference in such networks and comparisons to gradient-descent trained finite-width networks. I’ll also discuss the result of gradient dynamics in such infinitely wide networks and their equivalence to models that are linear in their parameters. Time permitting, I’ll briefly mention work in progress studying the behavior of networks close to and far from such “kernel” limits.
 

3:15 pm4:00 pm
Speaker: Richard Baraniuk (Rice University)

We build a rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators. Our key result is that a large class of DNs can be written as a composition of max-affine spline operators (MASOs), which provide a powerful portal through which to view and analyze their inner workings. For instance, conditioned on the input signal, the output of a MASO DN can be written as a simple affine transformation of the input. This implies that a DN constructs a set of signal-dependent, class-specific templates against which the signal is compared via a simple inner product; we explore the links to the classical theory of optimal classification via matched filters and the effects of data memorization. Going further, we propose a simple penalty term that can be added to the cost function of any DN learning algorithm to force the templates to be orthogonal with each other; this leads to significantly improved classification performance and reduced overfitting with no change to the DN architecture. The spline partition of the input signal space that is implicitly induced by a MASO directly links DNs to the theory of vector quantization (VQ) and K-means clustering, which opens up new geometric avenue to study how DNs organize signals in a hierarchical fashion. To validate the utility of the VQ interpretation, we develop and validate a new distance metric for signals and images that quantifies the difference between their VQ encodings.

4:30 pm5:15 pm
Speaker: Qiang Liu (University of Texas at Austin)

Efficient and automatic optimization of neural network structures is a key challenge in modern deep learning. Compared with parameter optimization which has been well addressed by (stochastic) gradient descent, optimizing model structures involves significantly more challenging combinatorial optimization with large search spaces and expensive evaluation functions. Although there have been rapid progresses recently, designing the best architectures still requires a lot of expert knowledge or expensive black-box optimization approaches (including reinforcement learning).

This work extends the power of gradient descent to the domain of model structure optimization. In particular, we consider the problem of progressively growing a neural network by “splitting” existing neurons into several “off-springs”, and develop a simple and practical approach for deciding the best subset of neurons to split and how to split them, adaptively based on the existing structure. Our approach is derived based on viewing structure optimization as a functional optimization in a space of distributions, such that our splitting strategy is a second order functional steepest descent for escaping saddle points in an infinite-Wasserstein metric space, while the standard parametric gradient descent is the first order functional steepest descent. Our method provides a very fast approach for finding efficient and compact neural architectures, especially for resource-constrained settings when smaller models are preferred for inference time and energy constraints.

Wednesday, July 17th, 2019

9:00 am9:45 am
Speaker: Zico Kolter (Carnegie Mellon University)

A great deal of recent work has proposed methods for "provable" or "certified" adversarial defenses: methods that guarantee that a
classifier will not change its prediction given small perturbations to the input. However, one major line of work in this area, based upon linear programming relaxations and duality (or equivalently, propagating uncertainly bounds through the network), has seemingly reached a plateau in performance, unable to train/verify networks at larger scales. In this talk, I will assess some of the reasons why this limit has been reached, and then highlight a simple alternative approach that does scale to much larger classifiers: randomized smoothing. I will present a simple overview of randomized smoothing techniques for adversarial robustness, and how they can (somewhat counterintuitively) lead to worst-case bounds rather than just average-case bounds. I will conclude with a discussion of the limitations of randomized smoothing, and whether there exists any possibility to combine these two paradigms in adversarial robustness.
 

9:45 am10:30 am
Speaker: Aleksander Madry (MIT)

The widespread susceptibility of the current ML models to adversarial perturbations is an intensely studied but still mystifying phenomenon. A popular view is that these perturbations are aberrations that arise due to statistical fluctuations in the training data and/or high-dimensional nature of our inputs.

But is this really the case?

In this talk, I will present a new perspective on the phenomenon of adversarial perturbations. This perspective ties this phenomenon to the existence of "non-robust" features: features derived from patterns in the data distribution that are highly predictive, yet brittle and incomprehensible to humans. Such patterns turn out to be prevalent in our real-world datasets and also shed light on previously observed phenomena in adversarial robustness, including transferability of adversarial examples and properties of robust models. Finally, this perspective suggests that we may need to recalibrate our expectations in terms of how models should make their decisions, and how we should interpret them.
 

11:00 am11:45 am
Speaker: Jerry Li (Microsoft Research)

Defenses for adversarial attacks against neural networks can broadly be divided into two categories: empirical and certifiable. Empirical defenses offer good guarantees against existing attacks, but we do not know how to prove that they cannot be broken. In the past couple of years, there has been a slew of empirical defenses, many of which have subsequently been broken. A notable exception is adversarial training (Madry et al 2018), variants of which are still the state of the art for empirical defenses. Certifiable defenses aim to avoid this arms race of defenses and attacks by providing provable guarantees that the network is robust to adversarial attacks. One promising certifiable defense is randomized smoothing (Lecuyer et al 2018, Li et al 2018, Cohen et al 2018), however their numbers still lag compared to those achieved by empirical defenses.

In this work we demonstrate that by combining adversarial training with randomized smoothing, we are able to substantially boost the provable robustness of the resulting classifier. We derive an attack against smoothed neural network classifiers, and we train the neural network using this attack, via the adversarial training paradigm. While adversarial training by itself typically gives no provable guarantees, we demonstrate that the certified robustness of the resulting classifier is substantially higher than previous state of the art, improving up to 16% for ell_2 perturbations on CIFAR-10, and 10% on ImageNet. By combining this with other ideas such as pre-training (Hendrycks et al 2019), we are able to improve these numbers further, establishing the state of the art for ell_2 provable robustness.

11:45 am12:30 pm
Speaker: Bin Yu (UC Berkeley)

In this talk, I'd like to discuss two projects interpreting DNNs. The first project proposes the DeepTune framework as a way to elicit
interpretations of DNN-based models of single neurons in the difficult primate visual cortex area V4. Using DNN-based features, DeepTune images combine 18 accurately predictive regression models through a stability criterion. They provide characterizations of 71 V4 neurons and data-driven stimuli for closed-loop experiments.

The second project introduces agglomerative contextual decomposition (ACD) for hierarchical interpretations of DNN predictions. Using examples from Stanford Sentiment Treebank and ImageNet, we show that ACD is effective at diagnosing incorrect predictions and identifying dataset bias. We also find that ACD's hierarchy is largely robust to adversarial perturbations, implying that it captures fundamental aspects of the input and ignores spurious noise.

2:30 pm3:15 pm
Speaker: Been Kim (Google Brain)

In this talk, I hope to reflect on some of the progress made in the field of interpretable machine learning. We will reflect on where we are going as a field, and what are the things we need to be aware and be careful as we make progress. With that perspective, I will then discuss some of my recent work 1) sanity checking popular methods and 2) developing more lay person-friendly interpretability method.

3:15 pm4:00 pm
Speaker: Nicholas Carlini (Google Brain)

Several hundred papers have been written over the last few years proposing defenses to adversarial examples. Unfortunately, most proposed defenses to adversarial examples are quickly broken.

This talk surveys the ways in which defenses to adversarial examples have been broken in the past, and what lessons we can learn from these breaks. Beginning with a discussion of common evaluation pitfalls when performing the initial analysis, I then provide recommendations for how we can perform more thorough defense evaluations. I conclude with a discussion on recent directions in adversarial robustness research and promising future directions for defenses.

4:30 pm5:30 pm

No abstract available.

Thursday, July 18th, 2019

9:00 am9:45 am
Speaker: Zack Lipton (Carnegie Mellon University)

We might hope that when faced with unexpected inputs, well-designed software systems would fire off warnings. However, ML systems, which depend strongly on properties of their inputs (e.g. the i.i.d. assumption), tend to fail silently. Faced with distribution shift, we wish (i) to detect and (ii) to quantify the shift, and (iii) to correct our classifiers on the fly—when possible. This talk will describe a line of recent work on tackling distribution shift. First, I will focus on recent work on label shift, a more classic problem, where strong assumptions enable principled methods. Then I will discuss how recent tools from generative adversarial networks have been appropriated (and misappropriated) to tackle dataset shift—characterizing and (partially) repairing a foundational flaw in the method.

9:45 am10:30 am
Speaker: Claire Tomlin (UC Berkeley)

Model-based control is a popular paradigm for robot navigation because it can leverage a known dynamics model to efficiently plan robust robot trajectories. The key challenge in robot navigation is safely and efficiently operating in environments which are unknown ahead of time, and can only be partially observed through sensors on the robot. In this talk, we present our work in coupling learning-based perception with model-based control. The learning-based perception module is trained using optimal control, and it produces a series of waypoints that guides the robot to the goal via a collision free path. These waypoints are then used by a model-based planner to generate a smooth and dynamically feasible trajectory that is executed on the physical system using feedback control. Our experiments in simulated real-world cluttered environments and on an actual ground vehicle demonstrate that the proposed approach can reach goal locations more reliably and efficiently in novel environments as compared to purely mapping-based or end-to-end learning-based alternatives. We conclude with some lessons learned in using learning-based perception in a feedback control loop.

This is joint work with Somil Bansal, Varun Tolani, Saurabh Gupta, Andrea Bajcsy, and Jitendra Malik.

11:00 am11:45 am
Speaker: Suriya Gunasekar (Toyota Technology Institute, Chicago)

A recent line of work studies overparametrized neural networks in the “kernel” regime, i.e. when during training the network behaves as a kernelized linear predictor, and thus training with gradient descent boils down to minimizing an RKHS norm. This stands in contrast to other studies that demonstrated how gradient descent on overparametrized multilayer networks can induce rich implicit biases that are very different from RKHS norms. In this talk, I will overview recent results to illustrate the reasons for these seeming contradictions. The goal of the talk would be to partially resolve the conditions under which training overparameterized models using gradient descent exhibit “kernel” like behavior vs “deep” behavior leading to models different from minimum RKHS norm models.

11:45 am12:30 pm
Speaker: Maithra Raghu (Cornell University & Google Brain)

In almost all applications of deep learning, transfer learning -- where a deep neural network is pretrained on a first task, and then
finetuned on a different, target task -- plays a crucial role. This is especially the case in medical imaging applications, where the de-facto method is to use a standard large model from natural image datasets (e.g. ImageNet) and corresponding pretrained weights. However, the effects of transfer learning are poorly understood, with several recent papers in the natural image setting challenging commonly believed intuitions. For medical applications, fundamental differences between data and task specifications mean that these questions and many others remain unexplored. In this talk, I survey some of the recent results on understanding transfer learning in natural images, as well as findings on transfer learning for medical tasks. In this latter setting, we observe a number of counter-intuitive results, including connections between feature reuse and overparametrization, surprisingly strong performance of lightweight non-standard architectures, and even feature independent effects of transfer.

2:30 pm3:15 pm
Speaker: Jascha Sohl-Dickstein (Google Brain)

The success of deep learning has hinged on learned functions dramatically outperforming hand-designed functions for many tasks. However, we still train models using hand designed parameter update rules acting on hand designed loss functions. I will argue that these hand designed components are typically mismatched to the desired model behavior, and that we can expect meta-learned update rules and loss functions to perform better. I will introduce meta-learned parameter update rules targeting unsupervised representation learning, semi-supervised learning, data augmentation, and supervised classification. I will show that these update rules can be made biologically plausible, allowing us to meta-learn algorithms which might be used by biological brains. I will additionally discuss common pathologies and challenges that are encountered when meta-learning an update rule, and demonstrate solutions to some of these pathologies.

3:15 pm4:00 pm
Speaker: Gintare Karolina Dziugaite (Element AI)

I will present my recent work on constructing generalization bounds for deep neural networks in order to understand existing learning algorithms and propose new ones. The tightness of generalization bounds varies widely, and depends on the complexity of the learning task and the amount of data available, but also on how much information the bounds take into consideration. My work is particularly concerned with data and algorithm-dependent bounds that are numerically nonvacuous. I will first present computational techniques that use PAC-Bayes bounds built from parameters obtained by stochastic gradient descent (SGD) and discuss the limitations of these bounds for particular choices of prior and posterior distributions on the parameters. I will then talk about my recent progress on tightening these bounds by constructing data-distribution dependent priors using the training data.

Joint work with Daniel M. Roy (University of Toronto, Vector Institute) and Waseem Gharbieh (Element AI), Alexander Lacoste (Element AI), Chin Wei (MILA and Element AI).

4:15 pm4:45 pm
Speaker: Aarti Singh (Carnegie Mellon University)

It is widely believed that the practical success of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) owes to the fact that CNNs and RNNs use a more compact parametric representation than their Fully-Connected Neural Network (FNN) counterparts, and consequently require fewer training examples to accurately estimate their parameters. We initiate the study of rigorously characterizing the sample-complexity of estimating CNNs and RNNs. We show that the sample-complexity to learn linear CNNs and RNNs scales linearly with their intrinsic dimension and this sample-complexity is much smaller than for their FNN counterparts. For both CNNs and RNNs, we also present lower bounds showing our sample complexities are tight up to logarithmic factors.