Monday, October 27th, 2014

9:30 am10:15 am

SVD is known to solve the inference problem for Topic Modeling when each document is pure (on a single topic). Known provable algorithms for non-pure models require strong assumptions on the data. We formulate a model with two natural assumptions: (i) each document has a dominant topic and (ii) each topic has dominant words. We provably solve the model fitting problem. Our algorithm has three main steps: Thresholding, SVD and k- means. We test both the assumptions and the performance of the algorithm on real corpora with success. Pre-SVD thresholding can be used in other contexts. For the “Planted Gaussian” problem, they offer an exponential advantage in terms of the signal-to-noise ratio.

Joint with T. Bansal and C. Bhattacharyya, Indian Institute of Science.

10:45 am11:30 am

When do convex relaxations return integer solutions? In this talk, I will discuss several instances of discrete optimization problems where, for a suitable distribution on inputs, LP and SDP relaxations produce integer solutions with high probability. This phenomenon has been studied in the context of LP decoding, sparse recovery, stochastic block models and so on. I will also mention some recent results for clustering problems: when points are drawn from a distribution over k sufficiently separated clusters, the well known k-median relaxation and a (not so well known) SDP relaxation for k-means exactly recover the clusters.

12:00 pm12:45 pm

The performance of spectral clustering can be considerably improved via regularization, as demonstrated empirically in Amini et. al (2012). Here, we provide an attempt at quantifying this improvement through theoretical analysis. Under the stochastic block model (SBM), and its extensions, previous results on spectral clustering relied on the minimum degree of the graph being sufficiently large for its good performance. By examining the scenario where the regularization parameter τ is large we show that the minimum degree assumption can potentially be removed. As a special case, for an SBM with two blocks, the results require the maximum degree to be large (grow faster than logn) as opposed to the minimum degree.

More importantly, we show the usefulness of regularization in situations where not all nodes belong to well-defined clusters. Our results rely on a `bias-variance'-like trade-off that arises from understanding the concentration of the sample Laplacian and the eigen gap as a function of the regularization parameter. As a byproduct of our bounds, we propose a data-driven technique \textit{DKest} (standing for estimated Davis-Kahan bounds) for choosing the regularization parameter. This technique is shown to work well through simulations and on a real data set.

This is joint work with Antony Joseph.

2:30 pm3:15 pm

Spectral analysis of graphs has lead to powerful algorithms, for example in machine learning, in particular for regression, classification and clustering. Eigenfunctions of the Laplacian on a graph are a natural basis for analyzing functions on a graph. In this talk we discuss a new flexible set of basis functions, called Diffusion Wavelets, that allow for a multiscale analysis of functions on a graph, very much in the same way classical wavelets perform a multiscale analysis in Euclidean spaces. They allow efficient, representation, compression, denoising of functions on the graph, and are very well-suited for statistical learning tasks, as well as unsupervised algorithms. They are also associated with a multiscale decomposition of the graph, which has applications by itself, for example to the analysis of time series of graphs, and for defining families of new robust distances between graphs. We will discuss this construction with several examples, going from signal processing on manifolds and graphs, to some applications to clustering and learning.

3:45 pm4:30 pm

Real world large graphs and networks often exhibit complex hierarchical structure, which goes beyond what can be uncovered by traditional linear algebra tools, such as eigendecomposition. In this talk I describe a new notion of matrix factorization inspired by multiresolution analysis that can capture structure at multiple scales. Multiresolution Matrix Factorizations (MMFs) both provide a model for graphs with multiscale structure, as well as a wavelet basis for approximating functions on such graphs. The work presented in this talk is joint with Nedelina Teneva and Vikas Garg.

Tuesday, October 28th, 2014

9:30 am10:15 am

The usual analysis of random walk on graphs uses the usual eigenvectors (and eigenvalues). For absorbing Markov chains a different theory is available. In joint work with Laurent Miclo we have begun studying rates of convergence to quasi-stationarity. We have lots of examples and the start of a general theory, associating an ergodic chain to an absorbing chain in such a way that rates are comparable. There is lots to be done.

10:45 am11:30 am

ICA is a classical model of unsupervised learning originating and widely used in signal processing. In this model, observations in n-space are obtained by an unknown linear transformation of a signal with independent components. The goal is to learn the the transformation and thereby the hidden basis of the independent components. In this talk, we'll begin with a polynomial-time algorithm called Fourier PCA, for underdetermined ICA, the setting where the observed signal is in a lower dimensional space than the original signal, assuming a mild, verifiable condition on the transformation. Its analysis uses higher-order derivatives of Fourier transformations and anticoncentration of Gaussian polynomials. The running time and sample complexity are polynomial, but of high constant degree, for any constant error; a MATLAB implementation of the algorithm appears to need 104 samples for 10-dimensional observations. We then present a more efficient recursive variant for the fully determined case of standard ICA that needs only a nearly linear number of samples and has polynomial time complexity. An implementation of this variant appears to outperform known heuristics. Its analysis is based on understanding the roots of random polynomials. This is joint work with Navin Goyal and Ying Xiao.

12:00 pm12:45 pm

Incorporating latent or hidden variables is a crucial aspect of statistical modeling. I will present a statistical and a computational framework for guaranteed learning of a wide range of latent variable models such as topic models, Gaussian mixtures, hidden Markov models, and community models. It involves decomposition of multivariate moment tensors through linear algebraic and multilinear algebraic operations. I will discuss the deployment of these methods for discovering communities in facebook, yelp and DBLP data. The tensor methods are easily parallelizable and are orders of magnitude faster than the state-of-art stochastic variational approach.

2:30 pm3:15 pm
Random walks on undirected graphs have been extensively studied in the literature. In contrast, while numerous real-world complex networks and related applications concern directed graphs, the theory for random walks on directed graphs and the associated spectral approaches  have not been as well developed. In this talk, we will discuss various problems and  results on random walks on directed graphs, as well as some methods and  applications arising in dealing with directed complex networks.
3:45 pm4:30 pm

The recent dramatic success of Deep Neural Networks (DNNs) in many applications highlights the statistical benefits of marrying near-nonparametric models with large datasets, using efficient optimization algorithms running in distributed computing environments. In the 1990's, Kernel methods became the toolset of choice for a wide variety of machine learning problems due to their theoretical appeal and algorithmic roots in convex optimization, replacing neural nets in many settings. So what changed between then and the modern deep learning revolution? Perhaps the advent of "big data" or perhaps the notion of "depth" or perhaps better DNN training algorithms, or all of the above. Or perhaps also that the development of kernel methods has somewhat lagged behind in terms of scalable training techniques, effective mechanisms for kernel learning and parallel implementations.

I will describe new efforts to resolve scalability challenges of kernel methods, for both scalar and multivariate prediction settings, allowing them to be trained on big data using a combination of randomized data embeddings, Quasi-Monte Carlo (QMC) acceleration, distributed convex optimization and input-output kernel learning. I will report that on classic speech recognition and computer vision datasets, randomized kernel methods and deep neural networks turn out to have essentially identical performance. Curiously, though, randomized kernel methods begin to look a bit like neural networks, but with a clear mathematical basis for their architecture. Conversely, invariant kernel learning and matrix-valued kernels may offer a new way to construct deeper architectures. This talk will describe recent research results and personal perspectives on points of synergy between these fields.

Wednesday, October 29th, 2014

9:30 am10:15 am

I will discuss the Principal Component Analysis problem for large tensors of arbitrary order k under a single-spike (or rank-one plus noise) model. On the one hand, using information theory, and recent results in probability theory I will provide necessary and sufficient conditions under  which the principal component can be estimated using unbounded computational resources. It turns out that this is possible as soon as the signal-to-noise ratio \beta becomes larger than C\sqrt{k\log k} (and in particular \beta can remain bounded has the problem dimensions increase). On the other hand, I will analyze several polynomial-time estimation algorithms, based on various approaches:

  1. Tensor matricization and spectral analysis, 
  2. Semidefinite relaxations,
  3. Power iteration.

I will show that, unless the signal-to-noise ratio diverges in the system dimensions, none of these approaches succeeds. This is possibly related to a fundamental limitation of polynomial estimators for this problem. While complexity theory suggests that intractability holds from a worst case point of view, no analogous result has been proved  under statistical models of the data.

10:45 am11:30 am

Graphs are a ubiquitous mathematical abstraction employed in numerous problems in science and engineering. Of particular importance is the need to find best structure-preserving matching of graphs. Sincegraph matching is a computationally intractable problem, numerous heuristics exist to approximate its solution. An important class of graph matching heuristics is relaxation techniques based on replacing the original problem by a continuous convex program. Conditions for applicability or inapplicability of such convex relaxations are poorly understood. In this talk, I will show easy to check spectral properties characterizing a wide family of graphs for which equivalence of convex relaxation to the exact graph matching is guaranteed to hold.

12:00 pm12:45 pm

The graph Laplacian has been of interest in statistics, machine learning, and theoretical computer science in areas from manifold learning to analysis of Markov chains. A common uses of the graph Laplacian has been in spectral clustering and dimension reduction. A theoretical motivation for why spectral clustering works is the Cheeger inequality which relates the eigenvalues of the graph Laplacian to how disconnected the graph is. We ask how the Cheeger inequality extends to higher-order Laplacians, operators on simplicial complexes, and what clustering means for these higher-order operators. Related to the graph Laplacian is the idea of random walks on graphs. We will define a random walk on simplicial complexes with a stationary distribution that is related to the k-dimensional Laplacian. The stationary distribution reveals (co)homology of the geometry of the random walk. We apply this random walk to the problem of semi-supervised learning, given some labeled observations and many unlabeled observations how does one propagate the labels.

2:30 pm3:15 pm

The graph Laplacian is a linear operator that takes real-valued functions on vertices to real-valued functions on vertices, or alternatively, a bilinear functional on pairs of real-valued functions on vertices. Three generalizations have recently been studied: the connection Laplacian [Wu-Singer, 2012] allows for tensor-valued or Lie group-valued functions on vertices; the Hodge Laplacian [Jiang-Lim-Yao-Ye, 2011] allows for real-valued functions on higher-order cliques (vertices are 0-cliques); the tensor Laplacian [Hu-Qi, 2013] allows for k-linear functional on k-tuples of real-valued vertex functions. We will compare the spectral theories of these three Laplacians in the context of graph theory.

Thursday, October 30th, 2014

10:45 am11:30 am

Several challenging problem in clustering, partitioning and imaging have traditionally been solved using the spectral technique. These problems include the normalized cut problem, the graph expander ratio problem, the Cheeger constant problem and the conductance problem. These problems share several common features: all seek a bipartition of a set of elements; the problems are formulated as a form of ratio cut equivalent to a quadratic ratio, referred to as the Rayleigh ratio. The Rayleigh ratio optimization is defined on discrete variables and a single sum constraint which we call the balance or orthogonality constraint. The spectral method for these problems is tantamount to solving a relaxation of the Rayleigh problem omitting the discreteness constraints. On the other hand, the relaxation of the orthogonality constraint yields optimization problems that are interesting in their own right. It is shown that a new, combinatorial, algorithm solves these latter discrete problems in strongly polynomial time in O(mn\log{n^2/m}) for a graph on n nodes and m edges. The implemented algorithm, using HPF (Hochbaum's Pseudo-Flow) as subroutine, is efficient enough to solve these bi-partitioning problems on millions of elements and more than 300 million edges within a couple of minutes. It is also shown, via an experimental study, that the results of the combinatorial algorithm proposed often improve dramatically on the quality of the results of the spectral method as measured by the proximity to the optimum of the respective NP-hard problems.

12:00 pm12:45 pm

Let M be a bounded domain of R^d with smooth boundary. We relate the Cheeger constant of M and the Cheeger constant of a neighborhood graph defined on a random sample from M. By restricting the minimization defining the latter over a particular class of subsets, we obtain consistency (after normalization) as the sample size increases, and show that any minimizing sequence of subsets has a subsequence converging to a Cheeger set of M.

Joint work with Bruno Pelletier and Pierre Pudlo.

2:15 pm3:00 pm

Hodge Theory is a milestone bridging differential geometry and algebraic topology. It studies certain functions (called forms) on data rather than data points themselves, and brings an optimization perspective to decompose such functions adaptive to the underlying topology. Recently Hodge Theory inspires rising applications in computer vision, multimedia, statistical ranking, game and voting theory, in addition to traditional applications in mechanics etc. In this talk we give an introduction to Hodge Theory with examples in these applications.

3:15 pm4:00 pm

Spectral clustering is a standard method for data analysis used in a broad range of applications. I will describe a new class of algorithms for multiway spectral clustering based on optimization of a certain class of functions after the spectral embedding. These algorithms can be interpreted geometrically as reconstructing a discrete weighted simplex. They have some resemblance to Independent Component Analysis and involve optimization of "contrast functions" over a sphere. However, in our case, theoretical guarantees can be provided for a much broader class of functions satisfying a "hidden" convexity condition. The algorithms are straightforward to implement, efficient and are not initialization-dependent.

4:15 pm5:00 pm

We study a variety of spectral-based information diffusion problems that occur in the literature as relaxations of min-cut objectives. This allows us to describe two methods to make these diffusion-based algorithms more robust to the details of graph constructions, errors in node labeling, degree variability, and a variety of other real-world heterogeneities. The two methods are solidly grounded in the theory of spectral graph methods and include a rank-based label assignment strategy and a scalable-approximation algorithm for the basic diffusion algorithm that implicitly has sparsity-based regularization properties. In addition to describing the framework, we provide an empirical study these methods on a digit prediction problem as well as a variety of other pedagogical examples.

Friday, October 31st, 2014

9:30 am10:15 am

Stability and regularization are key to successfully learn from sampled/noisy data. In this talk we show how different paradigms, beyond penalization, can be used to design learning algorithms that ensure stability and hence generalization. We first illustrate our approach for the problem of supervised function estimation and then for unsupervised support estimation. The different algorithms can be studied within a unified spectral regularization framework. Their analysis combines classical concepts from inverse problems and signal processing with concentration of measure results. Our study highlights connections between numerical and statistical stability.

10:45 am11:30 am

I present on-going research on two applications where spectral methods may be helpful. The first application models human memory search as a random walk, but with a twist that only the first visit to a state is observable. The task is to estimate the transition matrix from such censored observations. The second application teaches humans a target labeling on a graph. The assumption is that, given a few initial labeled nodes in the graph, humans utilize a known algorithm to complete the labeling on the rest of the graph (think of mincut). The task is to select the smallest set of initial nodes so that the completed labeling equals the target, i.e. the inverse problem of graph-based semi-supervised learning. The goal of the talk is to generate discussions on the research challenges in these and other similar applications.

12:00 pm12:45 pm
Speaker: Alexandr Andoni

We study spectral algorithms for the high-dimensional Nearest Neighbor Search problem (NNS). In the practice of NNS,spectral approaches are quire popular and often outperform the random-projection methods that otherwise seem theoretically optimal (on worst case datasets). This study aims to address the above disparity in our understanding.

In particular, we consider a semi-random setting where a dataset P in R^d is chosen arbitrarily from an unknown subspace of low dimension k (much less than d), and then perturbed by fully d-dimensional Gaussian noise. We design spectral NNS algorithms whose query time depends polynomially on d and log|P| for large ranges of k and d. Our algorithms use repeated computation of the top PCA vector/subspace, and are effective even when the random-noise magnitude is much larger than the interpoint distances in P.

Joint work with Amirali Abdullah, Ravi Kannan, Robi Krauthgamer.

2:30 pm3:15 pm

We present the first parallel algorithm for solving systems of linear equations in symmetric, diagonally dominant (SDD) matrices that runs in polylogarithmic time and nearly-linear work. The heart of our algorithm is a construction of a sparse approximate inverse chain for the input matrix: a sequence of sparse matrices whose product approximates its inverse. Whereas other fast algorithms for solving systems of equations in SDD matrices exploit low-stretch spanning trees and resemble W-cycles in the Multigrid method, our algorithm only requires spectral graph sparsifiers and is analogous to V-cycles.

Joint work with Dan Spielman.

3:45 pm4:30 pm

Today, the amount of data captured by sensors is outpacing our ability to process it efficiently. A common recourse is to partition the big datasets in smaller subsets and process them in parallel. However, such local processing approach will affect the underlying statistics, leaving unwanted artifacts, thus the need for a global approach.

Recent developments in algorithmic graph theory and linear solver technology have open up new tools such as combinatorial multigrid solvers. Image processing task are well suited to take advantage of the emergent graph techniques as images have underlying sparse graphs structure.

We would describe the transformation of fundamental image processing routines to solving grouped least squares problems over graphs. We would show how this global optimization approach gives better results over current processing techniques.

Next will be a discussion of implementations that leverages on new graph toolkits designed by the Big Data and Machine Learning community. Finally, results from tackling practical big image processing problems in remote sensing, using the graph-based approach, would be presented.