## Mathematics of Information

##### Session Organizers

Mark Iwen, Michigan State University

Rayan Saab, University of California, San Diego

### 8 December, 2022

##### Abstract

Among infinitely many factorizations A=VV* of a psd matrix A, we seek the factor V that has the smallest (1,2) norm. In this talk we review the origin of this problem as well as existing results regarding the optimal value. We discuss also the conjecture that the squared (1,2) norm of V is equivalent to the (1,1) norm of A.

##### Abstract

Over the last few decades, the (unconstrained) LASSO $$\min_{x\in \mathbb{R}^n} \frac12 \lVert A x-b\rVert_2^2 + \lambda \lVert x\rVert_1,$$ has become an indispensable tool in statistical learning, data science, and signal processing, thanks to its ability to efficiently recover sparse approximate solutions to (underdetermined) linear systems.

In this talk, we will present a novel variational analysis of this popular optimization program. First, we will establish smoothness results as well as Lipschitz properties for the optimal value and optimal solution maps of the LASSO as functions of the measurement vector $b$, the sampling matrix $A$, and, most notably, the tuning parameter $\lambda$. Then, we will illustrate how to apply the proposed variational analysis in the context of compressed sensing, validating our theoretical findings with numerical experiments. Finally, we will discuss extensions to related optimization problems such as the the square-root LASSO.

##### Abstract

Previous work has shown that a neural network with the rectified linear unit (ReLU) activation function leads to a convex polyhedral decomposition of the input space. In this talk, we will see how one can utilize this structure to detect and analyze adversarial attacks in the context of digital images.

When an image passes through a network containing ReLU nodes, the firing or non-firing at a node can be encoded as a bit ($1$ for ReLU activation, $0$ for ReLU non-activation). The sequence of all bit activations identifies the image with a bit vector, which identifies it with a polyhedron in the decomposition. We identify ReLU bits that are discriminators between non-adversarial and adversarial images and examine how well collections of these discriminators can ensemble vote to build an adversarial image detector and also present further applications of this induced geometry.

##### Abstract

The remarkable successes of neural networks in a huge variety of inverse problems have fueled their adoption in disciplines ranging from medical imaging to seismic analysis over the past decade. However, the high dimensionality of such inverse problems has simultaneously left current theory, which predicts that networks should scale exponentially in the dimension of the problem, unable to explain why the seemingly small networks used in these settings work as well as they do in practice. To reduce this gap between theory and practice, we provide a general method for bounding the complexity required for a neural network to approximate a Lipschitz function on a high-dimensional set with a low-complexity structure. The approach is based on the fact that many sets of interest in high dimensions have low-distortion linear embeddings into lower dimensional spaces. We can exploit this fact to show that the size of a neural network needed to approximate a Lipschitz function on a low-complexity set in a high dimensional space grows exponentially with the dimension of its low-distortion embedding, not the dimension of the space it lies in.

##### Abstract

The goal of Optimal Recovery is to uncover procedures that are worst-case optimal for the task of learning / recovering unknown functions from a priori information (the model) and a posteriori information (the observations). Optimal recovery procedures can sometimes be chosen as linear maps, at least when the observations are accurate. When they are corrupted, however, the problem becomes more complicated, especially if the focus is put on the practical construction of the (near) optimal procedure. This talk showcases positive results in three situations: deterministic observation errors bounded in $\ell_2$; deterministic observation errors bounded in $\ell_1$; and stochastic log-concave observation errors.

##### Abstract

When the data is large, or comes in a streaming way, randomized iterative methods provide an efficient way to solve a variety of problems, including solving linear systems, finding least square solutions, optimizing with gradient descent and Newton methods, solving feasibility problems and beyond. However, if the data is corrupted with arbitrarily large sparse errors, one can expect the iterates are diverted far from the solution with each corrupted equation they encounter. I am going to talk about our recently proposed robust versions of Randomized Kaczmarz and Stochastic Gradient Descent methods for solving linear systems that avoid harmful corrupted equations by exploring the space as they proceed with the iterates. We show that the convergence rates in expectation are comparable with the convergence of the vanilla methods on uncorrupted systems. I will also touch on how to use the information obtained on the exploration phase efficiently, and what structural characteristics of the data matrix are crucial for such methods. Based on the joint work with Deanna Needell, Jamie Haddock, Will Swartworth, Ben Jarman, and Lu Cheng.

##### Abstract

Quantization is one of the compression techniques to reduce computation cost, memory, and power consumption of deep neural networks (DNNs). In this talk, we will focus on a post-training quantization algorithm, GPFQ, that is based on a deterministic greedy path-following mechanism, and its stochastic variant SGPFQ. In both cases, we rigorously analyze the associated error bounds for quantization and show that for quantizing a single-layer network, the relative square error essentially decays linearly in the number of weights – i.e., level of over-parametrization. To empirically evaluate the method, we quantize several common DNN architectures with few bits per weight, and test them on ImageNet, showing only minor loss of accuracy compared to unquantized models.

##### Abstract

The celebrated universal approximation theorems for neural networks typically state that every sufficiently nice function can be arbitrarily well approximated by a neural network with carefully chosen parameters. Motivated by applications where compression is necessary, we ask whether it is possible to represent any reasonable function with a neural network whose parameters are restricted to a small set of values, with the extreme case being one-bit {+1,-1} neural networks? We answer this question in the affirmative for both quadratic and ReLU networks. Our main contributions include a novel approximation result for {+1,-1} sums of multivariate Bernstein polynomials and implementation of this approximation strategy via one-bit neural networks with either the quadratic or ReLU activation functions. Joint work with Sinan Gunturk.

### 9 December, 2022

##### Abstract

Sparse polynomial methods have, over the last decade, become widely used tools for approximating smooth, high-dimensional functions from limited data. Notable applications include parametric modelling and computational uncertainty quantification, discovering governing equations of dynamical systems, and high-dimensional PDEs. A standard sparse polynomial approximation scheme involves solving a convex optimization problem involving a $\ell^1$- or weighted $\ell^1$-norm. This is typically done using first-order methods. However, such methods generally converge very slowly, making them undesirable for the many applications that require high accuracy. Motivated by this problem, in this talk I will describe a general scheme for accelerating first-order optimization algorithms. This scheme applies to any convex optimization problem possessing a certain $\textit{approximate sharpness}$ condition and essentially any first-order optimization algorithm. For a wide range of problems, i.e., not just $\ell^1$-minimization-type problems, this scheme achieves either optimal rates of convergence or improves on previously established rates. To emphasize its usefulness and generality, I will showcase applications in compressive imaging and machine learning.

##### Abstract

A common approach for compressing large-scale data is through matrix sketching. In this talk, we consider the problem of recovering low-rank matrices or tensors from noisy sketches. We provide theoretical guarantees characterizing the error between the output of the sketching algorithm and the ground truth low-rank matrix or tensor. Applications of this approach to synthetic data and medical imaging data will be presented.

##### Abstract

Recently several streaming algorithms for computing low-rank Tucker approximations of large tensors have been introduced by several researchers including Tropp, Udell, et al., De Lathauwer et al., and others. In this talk we will review the basic approach used by these works highlighting their similarities, and then present generalized theoretical guarantees proving their accuracy on full-rank and noisy data with high probability from a wide range of measurements. Some of our newly proposed structured measurements have the advantage of reducing the overall memory needed to sketch and recover large, potentially streamed, tensors and can also be chosen to economize sketching and recovery time. In particular, our recovery algorithms (after measurements have been collected) are sublinear-time in the overall tensor size. In addition, we improve upon prior works by providing stronger recovery error bounds which apply to a much more general class of measurements than previously analyzed. Our numerical experiments further highlight the value of these new measurement choices in various problem scenarios.

##### Abstract

We will discuss variants of matrix and tensor CUR decompositions and algorithms for Robust PCA and matrix completion that allow one to observe only submatrices or subtensors of the data. By subsampling modes, we can obtain algorithms with state-of-the-art runtime for these tasks. Sample applications to image and video processing will be discussed.

##### Abstract

We study compressed sensing when the signal structure is the range of a (pre-trained) generative neural network (GNN) and give the first signal recovery guarantees when the measurements are subsampled from an orthonormal basis. This includes the subsampled Fourier transform as a special case, thus allowing a measurement model in line with applications such as MRI. We define a coherence parameter which depends upon the alignment of the orthonormal basis and the range of the GNN and show that small coherence implies robust signal recovery from relatively few measurements. Numerical experiments verify that regularizing to keep the coherence parameter small while training the GNN can improve performance in the signal recovery stage.

##### Abstract

We will introduce a mathematical signal transform based on Optimal Transport Theory. It builds upon the existing Cumulative Distribution Transform by Park, Kolouri, Kundu, and Rohde (2018). We will present both forward (analysis) and inverse (synthesis) formulas for this nonlinear transform, and describe several of its properties including translation, scaling, convexity, and linear separability. Indeed, since this tool is a new signal representation based on Optimal Transport theory, it has suitable properties for decoding information related to certain signal displacements. We will describe a Wasserstein-type metric in transform space, and show applications in classifying (detecting) signals under random displacements, and parameter estimation problems for certain types of generative models.

##### Abstract

Geometric Deep Learning is an emerging field of research that aims to extend the success of convolutional neural networks (CNNs) to data with non-Euclidean geometric structure such as graphs and manifolds. Despite being in its relative infancy, this field has already found great success and is utilized by e.g., Google Maps and Amazon’s recommender systems.

In order to improve our understanding of the networks used in this new field, several works have proposed novel versions of the scattering transform, a wavelet-based model of neural networks for graphs, manifolds, and more general measure spaces. In a similar spirit to the original Euclidean scattering transform, these geometric scattering transforms provide a mathematically rigorous framework for understanding the stability and invariance of the networks used in geometric deep learning. Additionally, they also have many interesting applications such as discovering new drug-like molecules, solving combinatorial optimization problems, and using single-cell data to predict whether or not a cancer patient will respond to treatment.

##### Abstract

We consider the nonlinear inverse problem of learning a transition operator ${A}$ from partial observations across different time scales, or in other words, from {sparse observations of its powers ${A},\cdots,{A}^{T-1}$. This Spatio-Temporal Transition Operator Recovery problem is motivated by the recent interest in learning time-varying graph signals that are driven by graph operators depending on the underlying graph topology. We address the non-linearity issue by embedding the problem into a higher-dimensional space of suitable block-Hankel matrices, where it becomes a low-rank matrix completion problem, even if ${A}$ is of full rank. For both a uniform and an adaptive random space-time sampling model, we quantify the recoverability of the transition operator via suitable measures of incoherence of these block-Hankel embedding matrices. For graph transition operators these measures of incoherence depend on the interplay between the dynamics and the graph topology. We establish quadratic local convergence analysis of a suitable non-convex iterative reweighted least squares (IRLS) algorithm, and show that in optimal scenarios, no more than $\mathcal{O}(rn \log(nT))$ space-time samples are sufficient to ensure accurate recovery of a rank-$r$ operator ${A}$ of size $n \times n$. This establishes that {spatial samples} can be substituted by a comparable number of {space-time samples}. We provide an efficient implementation of the proposed IRLS algorithm with space complexity of order $O(r n T)$ and whose per-iteration time complexity is {linear} in $n$. Numerical experiments for transition operators based on several graph models confirm that the theoretical findings accurately track empirical phase transitions, and illustrate the applicability of the proposed algorithm.