An algorithm runs in "input-sparsity time" if the running time in the RAM model is bounded above by a constant times the size of the input, for arbitrary or worst-case input. Somewhat remarkably, recent work in Randomized Numerical Linear Algebra has shown that fundamental problems such as least-squares and low-rank approximation can be solved to epsilon-accuracy in input-sparsity time. I will describe several related results in constructing low-distortion embeddings for Lp spaces in input-sparsity time, as well as the application of these results to solving problems such as least-squares, least-absolute deviations, and quantile regression. Of particular importance in making these theoretical ideas practical is understanding the tradeoffs between the running time to construct the embedding and the distortion quality of the embedding.
Monday, September 16th, 2013
Linear-time algorithms have long been considered the gold standard of computational efficiency. Indeed, it is hard to imagine doing better than that, since for a nontrivial problem, any algorithm must consider all of the input in order to make a decision. However, as extremely large data sets are pervasive, it is natural to wonder what one can do in {em sub-linear} time. Over the past two decades, several surprising advances have been made on designing such algorithms. We will give a non-exhaustive survey of this emerging area, highlighting recent progress and directions for further research.
The introduction of randomization in the design and analysis of algorithms for matrix computations (such as matrix multiplication, least-squares regression, the Singular Value Decomposition (SVD), etc.) over the last decade provided a new paradigm and a complementary perspective to traditional numerical linear algebra approaches. These novel approaches were motivated by technological developments in many areas of scientific research that permit the automatic generation of large data sets, which are often modeled as matrices.
In this talk we will outline how such approaches can be used to approximate problems ranging from matrix multiplication and the Singular Value Decomposition (SVD) of matrices to approximately solving least-squares problems and systems of linear equations. Application of such methods to data analysis will also be discussed.
A sketch of a matrix $A$ is another matrix $B$ which is significantly smaller than $A$, but still approximates it well.
Finding such sketches efficiently is an important building block in modern algorithms for approximating, for example, the PCA of massive matrices. This task is made more challenging in the streaming model, where each row of the input matrix can be processed only once and storage is severely limited.
In this paper, we adapt a well known streaming algorithm for approximating item frequencies to the matrix sketching setting. The algorithm receives $n$ rows of a large matrix $A in R^{n imes m}$ one after the other, in a streaming fashion. It maintains a sketch $B in R^ { ell imes m}$ containing only $ell ll n$ rows but still guarantees that $A^TA pprox B^TB$. More accurately,
$$ orall x, |x|=1 0 le |Ax|^2 - |Bx|^2 le 2|A|_{f}^2/ell $$
This algorithm's error decays proportionally to $1/ell$ using $O(m ell)$ space. In comparison, random-projection, hashing or sampling based algorithms produce convergence bounds proportional to $1/sqrt{ell}$. Sketch updates per row in $A$ require amortized $O(mell)$ operations and the algorithm is perfectly parallelizable. Our experiments corroborate the algorithm's scalability and improved convergence rate. The presented algorithm also stands out in that it is deterministic, simple to implement, and elementary to prove.
Tuesday, September 17th, 2013
Fingerprinting is a widely-used technique for efficiently verifying that two large files are identical. More generally, sketching is a form of lossy compression based on random linear projections that, for example, could be used to estimate the Hamming distance between the files. Both techniques are particularly useful for monitoring data streams or minimizing communication in distributed and parallel computing. Many applications have been studied extensively including sparse signal recovery, locality sensitive hashing, and linear regression.
Our goal is to extend these popular techniques and design random projections that preserve combinatorial and group-theoretic structure in large data sets. In this talk, we present recent results on analyzing spectral graph properties and fingerprinting misaligned text. Both results rely on developing homomorphic sketches that support operations on the compressed data that correspond to operations on the uncompressed data. Applications include the first sublinear space algorithms for processing dynamic graph streams.
Includes joint work with Kook Jin Ahn, Alexandr Andoni, Assaf Goldberger, Sudipto Guha, and Ely Porat.
We build long chains of inference based on the results of data mining. But how do we certify the results of these inference so that we can say things like "there's a 51% chance you're a US person?" And more importantly, how can we generate compact certificates to give to users to validate labels being assigned to them?
Formally, suppose I'm given a clustering of a set of points into groups. How do I assign a score (or set of scores) to a point that captures the strength of its assignment to a group? In this talk I'll describe a method that draws on Voronoi-based interpolation techniques to define such a score, and uses epsilon-samples and convex body sampling to compute the score efficiently. I'll then illustrate the use of such a score with example applications.
This is joint work with Parasaran Raman.
Linear sketches turn an n-dimensional input into a concise lower-dimensional representation via a linear transformation. In almost any realistic setting a linear sketch faces the possibility that its inputs are correlated with previous evaluations of the sketch. Known techniques no longer guarantee the correctness of the output in the presence of such correlations. We therefore ask: are linear sketches inherently non-robust to adaptively chosen inputs? We give a strong affirmative answer to this question. Specifically, we show that no linear sketch approximates the Euclidean norm of its input to within an arbitrary multiplicative approximation factor on a polynomial number of adaptively chosen inputs. The result remains true even if the dimension of the sketch is d = n - o(n) and the sketch is given unbounded computation time. Our result is based on an algorithm with running time polynomial in d that adaptively finds a distribution over inputs on which the sketch is incorrect with constant probability. Our result implies several corollaries for related problems including lp-norm estimation and l2/l2 compressed sensing in the presence of adaptively chosen inputs.
Joint work with Moritz Hardt.
Cloud computing suggests a model where communication between storage/computation devices is a bottleneck resource. An algorithm is nimble if it uses only sublinear (ideally polylogarithmic) communication and polynomial time and space. We expand on the model proposed by Cormode and Muthukrishnan and develop nimble algorithms for computing frequency moments, counting homomorphisms, low-rank approximation and k-means clustering.
This is joint work with Ravi Kannan and David Woodruff.
Row/column sampling techniques were initially introduced to compute low-rank matrix approximation faster than the computation of SVD. In this talk, I will give a quick overview of squared-length sampling, adaptive sampling, volume sampling, relevance-score sampling etc., survey their connections to other topics such as determinantal point processes, rounding Lasserre SDP solutions, graph sparsification, convex geometry, and discuss interesting open problems in these.
Many important properties of an undirected graph manifest themselves spectrally in the eigenvalues or quadratic forms of matrices related to the graph. For instance, the connectivity structure, electrical properties, and random walk behavior of a graph are determined by its Laplacian matrix. A spectral sparsifier of a graph G is a sparse graph H on the same set of vertices such that the Laplacians of H and G are close, so that H captures the spectral behavior of G while being much cheaper to store and perform computations on. We survey a line of work showing that a nearly linear size spectral sparsifier exists for every graph and can be computed in the one pass streaming model with small update time per edge, and discuss possible improvements which could make this algorithm more useful in practice.
We review recent results for sublinear-time testing of fundamental graph properties in sparse graphs.
We consider the problem in which an input graph G is represented as an adjacency matrix and we're asking a question whether G has a given property P or is "far" from any graph satisfying P. The central question in this area is which properties can be tested very quickly, in particular, in constant-time, independently of the size of the input graph.
We will present recent advances in this area, focusing on planar graphs and graphs with poor expansion properties.
This is talk is based on joint papers with Christian Sohler and others.
I will discuss two recent projects that strive to measure less but infer some or more about buildings/bridges and graphs. In the first project, our goal is to collect data about buildings and bridges for efficient, effective structural health monitoring. In the second project, our goal is to collect boundary measurements on a graph and to infer its "material" properties, using a mathematical generalization of optical tomography.
Wednesday, September 18th, 2013
Seriation seeks to reconstruct a linear order between variables using unsorted similarity information. It has direct applications in archeology and shotgun gene sequencing for example. We prove the equivalence between the seriation and the combinatorial 2-sum problem (a quadratic minimization problem over permutations) over a class of similarity matrices. The seriation problem can be solved exactly by a spectral algorithm in the noiseless case and we produce a convex relaxation for the 2-sum problem to improve the robustness of solutions in a noisy setting. This relaxation also allows us to impose additional structural constraints on the solution, to solve semi-supervised seriation problems. We present numerical experiments on archeological data, Markov chains and gene sequences.
Modern massive datasets create a fundamental problem at the intersection of the computational and statistical sciences: how to provide guarantees on the quality of statistical inference given bounds on computational resources such as time or space. Our approach to this problem is to define a notion of “algorithmic weakening,” in which a hierarchy of algorithms is ordered by both computational efficiency and statistical efficiency, allowing the growing strength of the data at scale to be traded off against the need for sophisticated processing. We illustrate this approach in the setting of denoising problems, using convex relaxation as the core inferential tool. Hierarchies of convex relaxations have been widely used in theoretical computer science to yield tractable approximation algorithms to many computationally intractable tasks. In this talk we show how to endow such hierarchies with a statistical characterization and thereby obtain concrete tradeoffs relating algorithmic runtime to amount of data.
Joint work with Michael Jordan.
We consider streaming, one-pass principal component analysis (PCA), in the high-dimensional regime, with limited memory. Here, $p$-dimensional samples are presented sequentially, and the goal is to produce the $k$-dimensional subspace that best approximates these points. Standard algorithms require $O(p^2)$ memory; meanwhile no algorithm can do better than $O(kp)$ memory, since this is what the output itself requires. Memory (or storage) complexity is most meaningful when understood in the context of computational and sample complexity. Sample complexity for high-dimensional PCA is typically studied in the setting of the {em spiked covariance model}, where $p$-dimensional points are generated from a population covariance equal to the identity (white noise) plus a low-dimensional perturbation (the spike) which is the signal to be recovered. It is now well-understood that the spi ke can be recovered when the number of samples, $n$, scales proportionally with the dimension, $p$. Yet, all algorithms that provably achieve this, have memory complexity $O(p^2)$. Meanwhile, algorithms with memory-complexity $O(kp)$ do not have provable bounds on sample complexity comparable to $p$. We present an algorithm that achieves both: it uses $O(kp)$ memory (meaning storage of any kind) and is able to compute the $k$-dimensional spike with $O(p log p)$ sample-complexity.
Models or signals exhibiting low dimensional behavior (e.g., sparse signals) play an important role in signal processing and machine learning. In this talk, we focus on models that have multiple structures simultaneously; e.g., matrices that are both low rank and sparse, arising in phase retrieval, quadratic compressed sensing, and sparse PCA. We consider estimating such models from observations corrupted by additive Gaussian noise.
We provide tight upper and lower bounds on the mean squared error (MSE) of a convex denoising program that uses a combination of regularizers. In the case of low rank and sparse matrices, we quantify the gap between the MSE of the convex program and the best achievable error, and present a simple (nonconvex) thresholding algorithm that outperforms its convex counterpart and achieves almost optimal MSE.
A feature common to many sparse optimization problems is that the number of variables may be significantly larger than the number of constraints—e.g., the standard matrix-lifting approach for binary optimization results in a problem where the number of variables is quadratic in the number of constraints. We consider a duality framework applicable to a wide range of nonsmooth sparse optimization problems that allows us to leverage the relatively small number of constraints. Preliminary numerical results illustrate our approach and its flexibility.
We study large-scale regression analysis where both the number of variables, p, and the number of observations, n, may be large and in the order of millions or more. This is very different from the now well-studied high-dimensional regression context of "large p, small n". For example, in our "large p, large n" setting, an ordinary least squares estimator may be inappropriate for computational, rather than statistical, reasons. In general one must seek a compromise between statistical and computational efficiency. Furthermore, in contrast to the common assumption of signal sparsity for high dimensional data, here it is the design matrices which are typically sparse in applications. Our approach for dealing with this large, sparse data is based on b-bit min-wise hashing (Li and Koenig, 2011). This can be viewed as a dimensionality reduction technique for a sparse binary design matrix; our variant, which we call min-wise hash random-sign mapping (MRS mapping) also handles the real-valued case, and is more amenable to statistical analysis. For both linear and the logistic models, we give finite-sample bounds on the prediction error of procedures which perform regression in the new lower-dimensional space after applying MRS mapping. In particular, we show that the average prediction error goes to 0 asymptotically as long as q*l(b)^2/n goes to 0 for n to infinity, where q is the maximal number of non-zero entries in each row of the design matrix and l(b) is the l2-norm of an optimal coefficient for the linear predictor. We also show that ordinary least squares or ridge regression applied to the reduced data can actually allow us to capture interactions in the original data. In addition to the results on prediction, we show how one can recover measures of the importance of the original variables. A simulation study and some applications help illustrate the circumstances under which we can expect MRS mapping to perform well.
Stability selection was recently introduced by Meinshausen and Buhlmann as a very general technique designed to improve the performance of a variable selection algorithm. It is based on aggregating the results of applying a selection procedure to subsamples of the data. We introduce a variant, called Complementary Pairs Stability Selection, and derive bounds both on the expected number of variables included by complementary pairs stability selection that have low selection probability under the original procedure, and on the expected number of high selection probability variables that are excluded. These results require no (e.g. exchangeability) assumptions on the underlying model or on the quality of the original selection procedure. Under reasonable shape restrictions, the bounds can be further tightened, yielding improved error control, and therefore increasing the applicability of the methodology.
Graphical models are a popular tool for understanding dependency structure of multivariate data, and sparse graphical models can provide an informative and interpretative summary of high-dimensional data. The commonly used graphical models are usually one of two types: Gaussian graphical models (for continuous data) and the Ising models or Markov networks (for binary and discrete data). However, in practice both types of variables are frequently present in the same dataset, creating the need for mixed graphical models. Some models for these were developed in the earlier literature, but none of them scale to high dimensions. We propose a novel graphical model for mixed data, which is simple enough to be suitable for high-dimensional data, yet flexible enough to represent all possible graph structures, and develop a computationally efficient algorithm for fitting the model. We will also discuss another extension of the graphical model that allows the graph to depend on additional covariates, and apply it to data on genetic instability in tumor samples.
Thursday, September 19th, 2013
While there has been significant strides made in developing small space algorithms in many constrained settings generally referred to massive data algorithms, a natural and ever present question remains: Can these techniques focused on constrained representations allow us to solve problems which we did not know how to solve in an unconstrained setting earlier?
In this talk we investigate classic variants of the Maximum Matching problem. We show that using ideas from succinct representations and linear programming, we can provide near optimal approximation schemed for several variants in near linear time which we did not know before. These ideas also improve the number of passes required by popular algorithmic frameworks such as MapReduce or Semi-Streaming for computing near optimal solutions to the maximum matching problem (as well as variants).
This talk will discuss sparse Johnson-Lindenstrauss transforms, i.e. sparse linear maps into much lower dimension which preserve the Euclidean geometry of a set of vectors. Both upper and lower bounds will be presented, as well as applications to certain domains such as numerical linear algebra.
The talk will cover a variety of probabilistic hashing methods for compact data representations and their applications in search and learning with massive, high-dimensional, and possibly dynamic streaming data. For example, fitting logistic regression (or SVM) with billion or billion square variables will be challenging and highly useful in the context of search; and the recent development of b-bit minwise hashing and one permutation hashing will be highly effective for this type of applications. Another exciting example is sparse recovery (compressed sensing). This talk will also present new hashing algorithms for accomplishing efficient signal recovery.
In many applications involving high dimensional data, we are interested in identifying sparse segments of signal in a noisy background. The signals can be copy number variation, or objects in video surveillance among others. We study the relationship between the detectability of a signal, and its shape as well as duration; and the implications of such a relationship on detection strategies.
Succinct data models have been widely used for imaging tasks such as recognition, denoising and superresolution. Their success depends crucially on the ability to learn models that accurately reflect the physical and statistical structure of the processes that generate the data. We describe two situations in which it is possible to do this, with provable performance guarantees. In the context of visual object recognition, we describe concise models which provably identify objects under arbitrary distant illumination. Our results show how to build a model that provably accepts any image that is sufficiently close to an image of the object of interest, and reject any that is not. We prove bounds that relate the complexity of the representation to the complexity of the object to be recognized, as measured by its defect from convexity. Our proofs and practical algorithms leverage t he the structure in the data—in particular, the existence of sparse and low-rank decompositions for the direct illumination operator. For other visual data problems such as denoising or superresolution, the ability to learn sparse models from data appears to be crucial to achieving good practical performance. For this problem, there is well-developed practice, but theoretical guarantees for the efficient algorithms are difficult to obtain, due to the bilinear nature of the problem. Here, again, we describe one situation in which mathematical guarantees are possible—namely, if the target dictionary is square and nonsingular, and the coefficients satisfy an appropriate random model. In this case, a simple heuristic based on convex programming guarantees to efficiently recover the target coefficients and dictionary.
This talk describes joint work with Yuqian Zhang (Columbia University), Cun Mu (Columbia University), Han-Wen Kuo (Columbia University), Dan Spielman (Yale University) and Huan Wang (Yahoo! Research).
Sparse methods for supervised learning aim at finding good linear predictors from as few variables as possible, i.e., with small cardinality of their supports. This combinatorial selection problem is often turned into a convex optimization problem by replacing the cardinality function by its convex envelope (tightest convex lower bound), in this case the L1-norm. In this work, we investigate more general set-functions than the cardinality, that may incorporate prior knowledge or structural constraints which are common in many applications: namely, we show that for nondecreasing submodular set-functions, the corresponding convex envelope can be obtained from its Lovasz extension, a common tool in submodular analysis. This defines a family of polyhedral norms, for which we provide generic algorithmic tools (subgradients and proximal operators) and theoretical results (conditions for support recovery or high-dimensional inference). By selecting specific submodular functions, we can give a new interpretation to known norms, such as those based on rank-statistics or grouped norms with potentially overlapping groups; we also define new norms, in particular ones that can be used as non-factorial priors for supervised learning.
We often model signals from the physical world with continuous parameterizations. Unfortunately, continuous models pose problems for the tools of sparse approximation, and popular discrete approximations are fraught with theoretical and algorithmic issues. In this talk, I will propose a general, convex-optimization framework—called atomic-norm denoising—that obviates discretization and gridding by generalizing sparse approximation to continuous dictionaries.
As an extended example that highlights the salient features of the atomic-norm framework, I will highlight the problem of estimating the frequencies and phases of a mixture of complex exponentials from noisy, incomplete data. I will demonstrate that atomic-norm denoising outperforms state-of-the-art spectral and sparse-approximation methods in both theory and practice. I will then close with a discussion of additional applications of the atomic-norm framework including deconvolution, deblurring, and system identification.
Joint work with Badri Bhaskar, Parikshit Shah, and Gongguo Tang.