Our understanding of modern neural networks lags behind their practical successes. This growing gap poses a challenge to the pace of progress in machine learning because fewer pillars of knowledge are available to designers of models and algorithms (Hanie Sedghi). Inspired by the ICML 2019 workshop Identifying and Understanding Deep Learning Phenomena, I collect papers and related resources which present interesting empirical study and insight into the nature of deep learning.
ModelDiff: A Framework for Comparing Learning Algorithms. [paper] [code]
- Harshay Shah, Sung Min Park, Andrew Ilyas, Aleksander Madry.
- Key Word: Representation-based Comparison; Example-level Comparisons; Comparing Feature Attributions.
Digest
We study the problem of (learning) algorithm comparison, where the goal is to find differences between models trained with two different learning algorithms. We begin by formalizing this goal as one of finding distinguishing feature transformations, i.e., input transformations that change the predictions of models trained with one learning algorithm but not the other. We then present ModelDiff, a method that leverages the datamodels framework (Ilyas et al., 2022) to compare learning algorithms based on how they use their training data.
Overfreezing Meets Overparameterization: A Double Descent Perspective on Transfer Learning of Deep Neural Networks. [paper]
- Yehuda Dar, Lorenzo Luzi, Richard G. Baraniuk.
- Key Word: Transfer Learning; Deep Double Descent; Overfreezing.
Digest
We study the generalization behavior of transfer learning of deep neural networks (DNNs). We adopt the overparameterization perspective — featuring interpolation of the training data (i.e., approximately zero train error) and the double descent phenomenon — to explain the delicate effect of the transfer learning setting on generalization performance. We study how the generalization behavior of transfer learning is affected by the dataset size in the source and target tasks, the number of transferred layers that are kept frozen in the target DNN training, and the similarity between the source and target tasks.
How to Fine-Tune Vision Models with SGD. [paper]
- Ananya Kumar, Ruoqi Shen, Sébastien Bubeck, Suriya Gunasekar.
- Key Word: Fine-Tuning; Out-of-Distribution Generalization.
Digest
We show that fine-tuning with AdamW performs substantially better than SGD on modern Vision Transformer and ConvNeXt models. We find that large gaps in performance between SGD and AdamW occur when the fine-tuning gradients in the first “embedding” layer are much larger than in the rest of the model. Our analysis suggests an easy fix that works consistently across datasets and models: merely freezing the embedding layer (less than 1\% of the parameters) leads to SGD performing competitively with AdamW while using less memory.
What Images are More Memorable to Machines? [paper] [code]
- Junlin Han, Huangying Zhan, Jie Hong, Pengfei Fang, Hongdong Li, Lars Petersson, Ian Reid.
- Key Word: Self-Supervised Memorization Quantification.
Digest
This paper studies the problem of measuring and predicting how memorable an image is to pattern recognition machines, as a path to explore machine intelligence. Firstly, we propose a self-supervised machine memory quantification pipeline, dubbed ``MachineMem measurer’’, to collect machine memorability scores of images. Similar to humans, machines also tend to memorize certain kinds of images, whereas the types of images that machines and humans memorialize are different.
Harmonizing the object recognition strategies of deep neural networks with humans. [paper] [code]
- Thomas Fel, Ivan Felipe, Drew Linsley, Thomas Serre.
- Key Word: Interpretation; Neural Harmonizer; Psychophysics.
Digest
Across 84 different DNNs trained on ImageNet and three independent datasets measuring the where and the how of human visual strategies for object recognition on those images, we find a systematic trade-off between DNN categorization accuracy and alignment with human visual strategies for object recognition. State-of-the-art DNNs are progressively becoming less aligned with humans as their accuracy improves. We rectify this growing issue with our neural harmonizer: a general-purpose training routine that both aligns DNN and human visual strategies and improves categorization accuracy.
Pruning’s Effect on Generalization Through the Lens of Training and Regularization. [paper]
- Tian Jin, Michael Carbin, Daniel M. Roy, Jonathan Frankle, Gintare Karolina Dziugaite.
- Key Word: Pruning; Regularization.
Digest
We show that size reduction cannot fully account for the generalization-improving effect of standard pruning algorithms. Instead, we find that pruning leads to better training at specific sparsities, improving the training loss over the dense model. We find that pruning also leads to additional regularization at other sparsities, reducing the accuracy degradation due to noisy examples over the dense model. Pruning extends model training time and reduces model size. These two factors improve training and add regularization respectively. We empirically demonstrate that both factors are essential to fully explaining pruning’s impact on generalization.
What does a deep neural network confidently perceive? The effective dimension of high certainty class manifolds and their low confidence boundaries. [paper] [code]
- Stanislav Fort, Ekin Dogus Cubuk, Surya Ganguli, Samuel S. Schoenholz.
- Key Word: Class Manifold; Linear Region; Out-of-Distribution Generalization.
Digest
Deep neural network classifiers partition input space into high confidence regions for each class. The geometry of these class manifolds (CMs) is widely studied and intimately related to model performance; for example, the margin depends on CM boundaries. We exploit the notions of Gaussian width and Gordon’s escape theorem to tractably estimate the effective dimension of CMs and their boundaries through tomographic intersections with random affine subspaces of varying dimension. We show several connections between the dimension of CMs, generalization, and robustness.
In What Ways Are Deep Neural Networks Invariant and How Should We Measure This? [paper]
- Henry Kvinge, Tegan H. Emerson, Grayson Jorgenson, Scott Vasquez, Timothy Doster, Jesse D. Lew. NeurIPS 2022
- Key Word: Invariance and Equivariance.
Digest
We explore the nature of invariance and equivariance of deep learning models with the goal of better understanding the ways in which they actually capture these concepts on a formal level. We introduce a family of invariance and equivariance metrics that allows us to quantify these properties in a way that disentangles them from other metrics such as loss or accuracy.
Relative representations enable zero-shot latent space communication. [paper]
- Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, Emanuele Rodolà.
- Key Word: Representation Similarity; Model stitching.
Digest
Neural networks embed the geometric structure of a data manifold lying in a high-dimensional space into latent representations. Ideally, the distribution of the data points in the latent space should depend only on the task, the data, the loss, and other architecture-specific constraints. However, factors such as the random weights initialization, training hyperparameters, or other sources of randomness in the training phase may induce incoherent latent spaces that hinder any form of reuse. Nevertheless, we empirically observe that, under the same data and modeling choices, distinct latent spaces typically differ by an unknown quasi-isometric transformation: that is, in each space, the distances between the encodings do not change. In this work, we propose to adopt pairwise similarities as an alternative data representation, that can be used to enforce the desired invariance without any additional training.
Minimalistic Unsupervised Learning with the Sparse Manifold Transform. [paper]
- Yubei Chen, Zeyu Yun, Yi Ma, Bruno Olshausen, Yann LeCun.
- Key Word: Self-Supervision; Sparse Manifold Transform.
Digest
We describe a minimalistic and interpretable method for unsupervised learning, without resorting to data augmentation, hyperparameter tuning, or other engineering designs, that achieves performance close to the SOTA SSL methods. Our approach leverages the sparse manifold transform, which unifies sparse coding, manifold learning, and slow feature analysis. With a one-layer deterministic sparse manifold transform, one can achieve 99.3% KNN top-1 accuracy on MNIST, 81.1% KNN top-1 accuracy on CIFAR-10 and 53.2% on CIFAR-100.
A Review of Sparse Expert Models in Deep Learning. [paper]
- William Fedus, Jeff Dean, Barret Zoph.
- Key Word: Mixture-of-Experts.
Digest
Sparse expert models are a thirty-year old concept re-emerging as a popular architecture in deep learning. This class of architecture encompasses Mixture-of-Experts, Switch Transformers, Routing Networks, BASE layers, and others, all with the unifying idea that each example is acted on by a subset of the parameters. By doing so, the degree of sparsity decouples the parameter count from the compute per example allowing for extremely large, but efficient models. The resulting models have demonstrated significant improvements across diverse domains such as natural language processing, computer vision, and speech recognition. We review the concept of sparse expert models, provide a basic description of the common algorithms, contextualize the advances in the deep learning era, and conclude by highlighting areas for future work.
A Data-Based Perspective on Transfer Learning. [paper] [code]
- Saachi Jain, Hadi Salman, Alaa Khaddaj, Eric Wong, Sung Min Park, Aleksander Madry.
- Key Word: Transfer Learning; Influence Function; Data Leakage.
Digest
It is commonly believed that in transfer learning including more pre-training data translates into better performance. However, recent evidence suggests that removing data from the source dataset can actually help too. In this work, we take a closer look at the role of the source dataset’s composition in transfer learning and present a framework for probing its impact on downstream performance. Our framework gives rise to new capabilities such as pinpointing transfer learning brittleness as well as detecting pathologies such as data-leakage and the presence of misleading examples in the source dataset.
When Does Re-initialization Work? [paper]
- Sheheryar Zaidi, Tudor Berariu, Hyunjik Kim, Jörg Bornschein, Claudia Clopath, Yee Whye Teh, Razvan Pascanu.
- Key Word: Re-initialization; Regularization.
Digest
We conduct an extensive empirical comparison of standard training with a selection of re-initialization methods to answer this question, training over 15,000 models on a variety of image classification benchmarks. We first establish that such methods are consistently beneficial for generalization in the absence of any other regularization. However, when deployed alongside other carefully tuned regularization techniques, re-initialization methods offer little to no added benefit for generalization, although optimal generalization performance becomes less sensitive to the choice of learning rate and weight decay hyperparameters. To investigate the impact of re-initialization methods on noisy data, we also consider learning under label noise. Surprisingly, in this case, re-initialization significantly improves upon standard training, even in the presence of other carefully tuned regularization techniques.
How You Start Matters for Generalization. [paper]
- Sameera Ramasinghe, Lachlan MacDonald, Moshiur Farazi, Hemanth Sartachandran, Simon Lucey.
- Key Word: Implicit regularization; Fourier Spectrum.
Digest
We promote a shift of focus towards initialization rather than neural architecture or (stochastic) gradient descent to explain this implicit regularization. Through a Fourier lens, we derive a general result for the spectral bias of neural networks and show that the generalization of neural networks is heavily tied to their initialization. Further, we empirically solidify the developed theoretical insights using practical, deep networks.
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? [paper] [code]
- Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, Luke Zettlemoyer.
- Key Word: Natural Language Processing; In-Context Learning.
Digest
We show that ground truth demonstrations are in fact not required — randomly replacing labels in the demonstrations barely hurts performance, consistently over 12 different models including GPT-3. Instead, we find that other aspects of the demonstrations are the key drivers of end task performance, including the fact that they provide a few examples of (1) the label space, (2) the distribution of the input text, and (3) the overall format of the sequence.
Angular Visual Hardness. [paper]
- Beidi Chen, Weiyang Liu, Zhiding Yu, Jan Kautz, Anshumali Shrivastava, Animesh Garg, Anima Anandkumar. ICML 2020
- Key Word: Calibration; Example Hardness Measures.
Digest
We propose a novel measure for CNN models known as Angular Visual Hardness. Our comprehensive empirical studies show that AVH can serve as an indicator of generalization abilities of neural networks, and improving SOTA accuracy entails improving accuracy on hard example
Fantastic Generalization Measures and Where to Find Them. [paper] [code]
- Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, Samy Bengio. ICLR 2020
- Key Word: Complexity Measures; Spurious Correlations.
Digest
We present the first large scale study of generalization in deep networks. We investigate more then 40 complexity measures taken from both theoretical bounds and empirical studies. We train over 10,000 convolutional networks by systematically varying commonly used hyperparameters. Hoping to uncover potentially causal relationships between each measure and generalization, we analyze carefully controlled experiments and show surprising failures of some measures as well as promising measures for further research.
Truth or Backpropaganda? An Empirical Investigation of Deep Learning Theory. [paper] [code]
- Micah Goldblum, Jonas Geiping, Avi Schwarzschild, Michael Moeller, Tom Goldstein. ICLR 2020
- Key Word: Local Minima.
Digest
The authors take a closer look at widely held beliefs about neural networks. Using a mix of analysis and experiment, they shed some light on the ways these assumptions break down.
Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML. [paper] [code]
- Aniruddh Raghu, Maithra Raghu, Samy Bengio, Oriol Vinyals. ICLR 2020
- Key Word: Meta Learning.
Digest
Despite MAML’s popularity, a fundamental open question remains — is the effectiveness of MAML due to the meta-initialization being primed for rapid learning (large, efficient changes in the representations) or due to feature reuse, with the meta initialization already containing high quality features? We investigate this question, via ablation studies and analysis of the latent representations, finding that feature reuse is the dominant factor.
Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias. [paper] [code]
- Stéphane d’Ascoli, Levent Sagun, Joan Bruna, Giulio Biroli. NeurIPS 2019
- Key Word: Architectural Bias.
Digest
In particular, Convolutional Neural Networks (CNNs) are known to perform much better than Fully-Connected Networks (FCNs) on spatially structured data: the architectural structure of CNNs benefits from prior knowledge on the features of the data, for instance their translation invariance. The aim of this work is to understand this fact through the lens of dynamics in the loss landscape.
Adversarial Training Can Hurt Generalization. [paper]
- Aditi Raghunathan, Sang Michael Xie, Fanny Yang, John C. Duchi, Percy Liang.
- Key Word: Adversarial Examples.
Digest
While adversarial training can improve robust accuracy (against an adversary), it sometimes hurts standard accuracy (when there is no adversary). Previous work has studied this tradeoff between standard and robust accuracy, but only in the setting where no predictor performs well on both objectives in the infinite data limit. In this paper, we show that even when the optimal predictor with infinite data performs well on both objectives, a tradeoff can still manifest itself with finite data.
Bad Global Minima Exist and SGD Can Reach Them. [paper] [code]
- Shengchao Liu, Dimitris Papailiopoulos, Dimitris Achlioptas. NeurIPS 2020
- Key Word: Stochastic Gradient Descent.
Digest
Several works have aimed to explain why overparameterized neural networks generalize well when trained by Stochastic Gradient Descent (SGD). The consensus explanation that has emerged credits the randomized nature of SGD for the bias of the training process towards low-complexity models and, thus, for implicit regularization. We take a careful look at this explanation in the context of image classification with common deep neural network architectures. We find that if we do not regularize explicitly, then SGD can be easily made to converge to poorly-generalizing, high-complexity models: all it takes is to first train on a random labeling on the data, before switching to properly training with the correct labels.
Deep ReLU Networks Have Surprisingly Few Activation Patterns. [paper]
- Boris Hanin, David Rolnick. NeurIPS 2019
Digest
In this paper, we show that the average number of activation patterns for ReLU networks at initialization is bounded by the total number of neurons raised to the input dimension. We show empirically that this bound, which is independent of the depth, is tight both at initialization and during training, even on memorization tasks that should maximize the number of activation patterns.
Sensitivity of Deep Convolutional Networks to Gabor Noise. [paper] [code]
- Kenneth T. Co, Luis Muñoz-González, Emil C. Lupu.
- Key Word: Robustness.
Digest
Deep Convolutional Networks (DCNs) have been shown to be sensitive to Universal Adversarial Perturbations (UAPs): input-agnostic perturbations that fool a model on large portions of a dataset. These UAPs exhibit interesting visual patterns, but this phenomena is, as yet, poorly understood. Our work shows that visually similar procedural noise patterns also act as UAPs. In particular, we demonstrate that different DCN architectures are sensitive to Gabor noise patterns. This behaviour, its causes, and implications deserve further in-depth study.
Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks. [paper]
- Guangyong Chen, Pengfei Chen, Yujun Shi, Chang-Yu Hsieh, Benben Liao, Shengyu Zhang.
- Key Word: Batch Normalization; Dropout.
Digest
Our work is based on an excellent idea that whitening the inputs of neural networks can achieve a fast convergence speed. Given the well-known fact that independent components must be whitened, we introduce a novel Independent-Component (IC) layer before each weight layer, whose inputs would be made more independent.
A critical analysis of self-supervision, or what we can learn from a single image. [paper] [code]
- Yuki M. Asano, Christian Rupprecht, Andrea Vedaldi. ICLR 2020
- Key Word: Self-Supervision.
Digest
We show that three different and representative methods, BiGAN, RotNet and DeepCluster, can learn the first few layers of a convolutional network from a single image as well as using millions of images and manual labels, provided that strong data augmentation is used. However, for deeper layers the gap with manual supervision cannot be closed even if millions of unlabelled images are used for training.
Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet. [paper] [code]
- Wieland Brendel, Matthias Bethge. ICLR 2019
- Key Word: Bag-of-Features.
Digest
Our model, a simple variant of the ResNet-50 architecture called BagNet, classifies an image based on the occurrences of small local image features without taking into account their spatial ordering. This strategy is closely related to the bag-of-feature (BoF) models popular before the onset of deep learning and reaches a surprisingly high accuracy on ImageNet.
Transfusion: Understanding Transfer Learning for Medical Imaging. [paper] [code]
- Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, Samy Bengio. NeurIPS 2019
- Key Word: Transfer Learning; Medical Imaging.
Digest
we explore properties of transfer learning for medical imaging. A performance evaluation on two large scale medical imaging tasks shows that surprisingly, transfer offers little benefit to performance, and simple, lightweight models can perform comparably to ImageNet architectures.
Identity Crisis: Memorization and Generalization under Extreme Overparameterization. [paper]
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Michael C. Mozer, Yoram Singer. ICLR 2020
- Key Word: Memorization.
Digest
We study the interplay between memorization and generalization of overparameterized networks in the extreme case of a single training example and an identity-mapping task.
Are All Layers Created Equal? [paper]
- Chiyuan Zhang, Samy Bengio, Yoram Singer. JMLR
- Key Word: Robustness.
Digest
We show that the layers can be categorized as either “ambient” or “critical”. Resetting the ambient layers to their initial values has no negative consequence, and in many cases they barely change throughout training. On the contrary, resetting the critical layers completely destroys the predictor and the performance drops to chance.
Revisit Kernel Pruning with Lottery Regulated Grouped Convolutions. [paper] [code]
- Shaochen Zhong, Guanqun Zhang, Ningjia Huang, Shuai Xu. ICLR 2022
- Key Word: Lottery Ticket Hypothesis.
Digest
We revisit the idea of kernel pruning, a heavily overlooked approach under the context of structured pruning. This is because kernel pruning will naturally introduce sparsity to filters within the same convolutional layer — thus, making the remaining network no longer dense. We address this problem by proposing a versatile grouped pruning framework where we first cluster filters from each convolutional layer into equal-sized groups, prune the grouped kernels we deem unimportant from each filter group, then permute the remaining filters to form a densely grouped convolutional architecture (which also enables the parallel computing capability) for fine-tuning.
Proving the Lottery Ticket Hypothesis for Convolutional Neural Networks. [paper]
- Arthur da Cunha, Emanuele Natale, Laurent Viennot, Laurent_Viennot. ICLR 2022
- Key Word: Lottery Ticket Hypothesis.
Digest
Recent theoretical works proved an even stronger version: every sufficiently overparameterized (dense) neural network contains a subnetwork that, even without training, achieves accuracy comparable to that of the trained large network. These works left as an open problem to extend the result to convolutional neural networks (CNNs). In this work we provide such generalization by showing that, with high probability, it is possible to approximate any CNN by pruning a random CNN whose size is larger by a logarithmic factor.
Audio Lottery: Speech Recognition Made Ultra-Lightweight, Noise-Robust, and Transferable. [paper] [code]
- Shaojin Ding, Tianlong Chen, Zhangyang Wang. ICLR 2022
- Key Word: Lottery Ticket Hypothesis; Speech Recognition.
Digest
We investigate the tantalizing possibility of using lottery ticket hypothesis to discover lightweight speech recognition models, that are (1) robust to various noise existing in speech; (2) transferable to fit the open-world personalization; and 3) compatible with structured sparsity.
Strong Lottery Ticket Hypothesis with ε—perturbation. [paper]
- Zheyang Xiong, Fangshuo Liao, Anastasios Kyrillidis.
- Key Word: Lottery Ticket Hypothesis.
Digest
The strong Lottery Ticket Hypothesis (LTH) claims the existence of a subnetwork in a sufficiently large, randomly initialized neural network that approximates some target neural network without the need of training. We extend the theoretical guarantee of the strong LTH literature to a scenario more similar to the original LTH, by generalizing the weight change in the pre-training step to some perturbation around initialization.
Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers. [paper]
- Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar.
- Key Word: Sparse Activation; Large Models; Transformers.
Digest
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by “sparse” we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP.
Unmasking the Lottery Ticket Hypothesis: What’s Encoded in a Winning Ticket’s Mask? [paper]
- Mansheej Paul, Feng Chen, Brett W. Larsen, Jonathan Frankle, Surya Ganguli, Gintare Karolina Dziugaite.
- Key Word: Lottery Ticket Hypothesis; Mode Connectivity.
Digest
First, we find that—at higher sparsities—pairs of pruned networks at successive pruning iterations are connected by a linear path with zero error barrier if and only if they are matching. This indicates that masks found at the end of training convey the identity of an axial subspace that intersects a desired linearly connected mode of a matching sublevel set. Second, we show SGD can exploit this information due to a strong form of robustness: it can return to this mode despite strong perturbations early in training. Third, we show how the flatness of the error landscape at the end of training determines a limit on the fraction of weights that can be pruned at each iteration of IMP. Finally, we show that the role of retraining in IMP is to find a network with new small weights to prune.
How Erdös and Rényi Win the Lottery. [paper]
- Advait Gadhikar, Sohum Mukherjee, Rebekka Burkholz.
- Key Word: Lottery Ticket Hypothesis; Erdös-Rényi Random Graphs.
Digest
Random masks define surprisingly effective sparse neural network models, as has been shown empirically. The resulting Erdös-Rényi (ER) random graphs can often compete with dense architectures and state-of-the-art lottery ticket pruning algorithms struggle to outperform them, even though the random baselines do not rely on computationally expensive pruning-training iterations but can be drawn initially without significant computational overhead. We offer a theoretical explanation of how such ER masks can approximate arbitrary target networks if they are wider by a logarithmic factor in the inverse sparsity 1/log(1/sparsity).
SparCL: Sparse Continual Learning on the Edge. [paper]
- Zifeng Wang, Zheng Zhan, Yifan Gong, Geng Yuan, Wei Niu, Tong Jian, Bin Ren, Stratis Ioannidis, Yanzhi Wang, Jennifer Dy. NeurIPS 2022
- Key Word: Continual Learning; Sparse Training.
Digest
We propose a novel framework called Sparse Continual Learning(SparCL), which is the first study that leverages sparsity to enable cost-effective continual learning on edge devices. SparCL achieves both training acceleration and accuracy preservation through the synergy of three aspects: weight sparsity, data efficiency, and gradient sparsity. Specifically, we propose task-aware dynamic masking (TDM) to learn a sparse network throughout the entire CL process, dynamic data removal (DDR) to remove less informative training data, and dynamic gradient masking (DGM) to sparsify the gradient updates.
One-shot Network Pruning at Initialization with Discriminative Image Patches. [paper]
- Yinan Yang, Ying Ji, Yu Wang, Heng Qi, Jien Kato.
- Key Word: One-Shot Network Pruning.
Digest
We propose two novel methods, Discriminative One-shot Network Pruning (DOP) and Super Stitching, to prune the network by high-level visual discriminative image patches. Our contributions are as follows. (1) Extensive experiments reveal that OPaI is data-dependent. (2) Super Stitching performs significantly better than the original OPaI method on benchmark ImageNet, especially in a highly compressed model.
SuperTickets: Drawing Task-Agnostic Lottery Tickets from Supernets via Jointly Architecture Searching and Parameter Pruning. [paper] [code]
- Haoran You, Baopu Li, Zhanyi Sun, Xu Ouyang, Yingyan Lin. ECCV 2022
- Key Word: Lottery Ticket Hypothesis; Neural Architecture Search.
Digest
We discover for the first time that both efficient DNNs and their lottery subnetworks (i.e., lottery tickets) can be directly identified from a supernet, which we term as SuperTickets, via a two-in-one training scheme with jointly architecture searching and parameter pruning. Moreover, we develop a progressive and unified SuperTickets identification strategy that allows the connectivity of subnetworks to change during supernet training, achieving better accuracy and efficiency trade-offs than conventional sparse training.
Lottery Ticket Hypothesis for Spiking Neural Networks. [paper]
- Youngeun Kim, Yuhang Li, Hyoungseob Park, Yeshwanth Venkatesha, Ruokai Yin, Priyadarshini Panda. ECCV 2022
- Key Word: Lottery Ticket Hypothesis; Spiking Neural Networks.
Digest
Spiking Neural Networks (SNNs) have recently emerged as a new generation of low-power deep neural networks where binary spikes convey information across multiple timesteps. Pruning for SNNs is highly important as they become deployed on a resource-constraint mobile/edge device. The previous SNN pruning works focus on shallow SNNs (2~6 layers), however, deeper SNNs (>16 layers) are proposed by state-of-the-art SNN works, which is difficult to be compatible with the current pruning work. To scale up a pruning technique toward deep SNNs, we investigate Lottery Ticket Hypothesis (LTH) which states that dense networks contain smaller subnetworks (i.e., winning tickets) that achieve comparable performance to the dense networks. Our studies on LTH reveal that the winning tickets consistently exist in deep SNNs across various datasets and architectures, providing up to 97% sparsity without huge performance degradation.
Winning the Lottery Ahead of Time: Efficient Early Network Pruning. [paper]
- John Rachwan, Daniel Zügner, Bertrand Charpentier, Simon Geisler, Morgane Ayle, Stephan Günnemann. ICML 2022
- Key Word: Lottery Ticket Hypothesis; Neural Tangent Kernel.
Digest
Although state-of-the-art pruning methods extract highly sparse models, they neglect two main challenges: (1) the process of finding these sparse models is often very expensive; (2) unstructured pruning does not provide benefits in terms of GPU memory, training time, or carbon emissions. We propose Early Compression via Gradient Flow Preservation (EarlyCroP), which efficiently extracts state-of-the-art sparse models before or early in training addressing challenge (1), and can be applied in a structured manner addressing challenge (2). This enables us to train sparse networks on commodity GPUs whose dense versions would be too large, thereby saving costs and reducing hardware requirements.
“Understanding Robustness Lottery”: A Comparative Visual Analysis of Neural Network Pruning Approaches. [paper]
- Zhimin Li, Shusen Liu, Xin Yu, Kailkhura Bhavya, Jie Cao, Diffenderfer James Daniel, Peer-Timo Bremer, Valerio Pascucci.
- Key Word: Lottery Ticket Hypothesis; Out-of-Distribution Generalization; Visualization.
Digest
This work aims to shed light on how different pruning methods alter the network’s internal feature representation, and the corresponding impact on model performance. To provide a meaningful comparison and characterization of model feature space, we use three geometric metrics that are decomposed from the common adopted classification loss. With these metrics, we design a visualization system to highlight the impact of pruning on model prediction as well as the latent feature embedding.
Data-Efficient Double-Win Lottery Tickets from Robust Pre-training. [paper] [code]
- Tianlong Chen, Zhenyu Zhang, Sijia Liu, Yang Zhang, Shiyu Chang, Zhangyang Wang. ICML 2022
- Key Word: Lottery Ticket Hypothesis; Adversarial Training; Robust Pre-training.
Digest
We formulate a more rigorous concept, Double-Win Lottery Tickets, in which a located subnetwork from a pre-trained model can be independently transferred on diverse downstream tasks, to reach BOTH the same standard and robust generalization, under BOTH standard and adversarial training regimes, as the full pre-trained model can do. We comprehensively examine various pre-training mechanisms and find that robust pre-training tends to craft sparser double-win lottery tickets with superior performance over the standard counterparts.
HideNseek: Federated Lottery Ticket via Server-side Pruning and Sign Supermask. [paper]
- Anish K. Vallapuram, Pengyuan Zhou, Young D. Kwon, Lik Hang Lee, Hengwei Xu, Pan Hui.
- Key Word: Lottery Ticket Hypothesis; Federated Learning.
Digest
We propose HideNseek which employs one-shot data-agnostic pruning at initialization to get a subnetwork based on weights’ synaptic saliency. Each client then optimizes a sign supermask multiplied by the unpruned weights to allow faster convergence with the same compression rates as state-of-the-art.
Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks. [paper] [code]
- Mansheej Paul, Brett W. Larsen, Surya Ganguli, Jonathan Frankle, Gintare Karolina Dziugaite. NeurIPS 2022
- Key Word: Lottery Ticket Hypothesis; Pre-training.
Digest
We seek to understand how this early phase of pre-training leads to a good initialization for IMP both through the lens of the data distribution and the loss landscape geometry. Empirically we observe that, holding the number of pre-training iterations constant, training on a small fraction of (randomly chosen) data suffices to obtain an equally good initialization for IMP. We additionally observe that by pre-training only on “easy” training data, we can decrease the number of steps necessary to find a good initialization for IMP compared to training on the full dataset or a randomly chosen subset. Finally, we identify novel properties of the loss landscape of dense networks that are predictive of IMP performance, showing in particular that more examples being linearly mode connected in the dense network correlates well with good initializations for IMP.
Analyzing Lottery Ticket Hypothesis from PAC-Bayesian Theory Perspective. [paper]
- Keitaro Sakamoto, Issei Sato. NeurIPS 2022
- Key Word: Lottery Ticket Hypothesis; PAC-Bayes.
Digest
We confirm this hypothesis and show that the PAC-Bayesian theory can provide an explicit understanding of the relationship between LTH and generalization behavior. On the basis of our experimental findings that flatness is useful for improving accuracy and robustness to label noise and that the distance from the initial weights is deeply involved in winning tickets, we offer the PAC-Bayes bound using a spike-and-slab distribution to analyze winning tickets.
Dual Lottery Ticket Hypothesis. [paper] [code]
- Yue Bai, Huan Wang, Zhiqiang Tao, Kunpeng Li, Yun Fu. ICLR 2022
- Key Word: Lottery Ticket Hypothesis.
Digest
This paper articulates a Dual Lottery Ticket Hypothesis (DLTH) as a dual format of original Lottery Ticket Hypothesis (LTH). Correspondingly, a simple regularization based sparse network training strategy, Random Sparse Network Transformation (RST), is proposed to validate DLTH and enhance sparse network training.
Rare Gems: Finding Lottery Tickets at Initialization. [paper]
- Kartik Sreenivasan, Jy-yong Sohn, Liu Yang, Matthew Grinde, Alliot Nagle, Hongyi Wang, Eric Xing, Kangwook Lee, Dimitris Papailiopoulos. NeurIPS 2022
- Key Word: Lottery Ticket Hypothesis; Sanity Checks; Pruning at Initialization.
Digest
Finding lottery tickets that train to better accuracy compared to simple baselines remains an open problem. In this work, we resolve this open problem by proposing Gem-Miner which finds lottery tickets at initialization that beat current baselines. Gem-Miner finds lottery tickets trainable to accuracy competitive or better than Iterative Magnitude Pruning (IMP), and does so up to 19× faster.
Reconstruction Task Finds Universal Winning Tickets. [paper]
- Ruichen Li, Binghui Li, Qi Qian, Liwei Wang.
- Key Word: Lottery Ticket Hypothesis; Self-Supervision.
Digest
We show that the image-level pretrain task is not capable of pruning models for diverse downstream tasks. To mitigate this problem, we introduce image reconstruction, a pixel-level task, into the traditional pruning framework. Concretely, an autoencoder is trained based on the original model, and then the pruning process is optimized with both autoencoder and classification losses.
Finding Dynamics Preserving Adversarial Winning Tickets. [paper] [code]
- Xupeng Shi, Pengfei Zheng, A. Adam Ding, Yuan Gao, Weizhong Zhang. AISTATS 2022
- Key Word: Lottery Ticket Hypothesis; Neural Tangent Kernel.
Digest
Based on recent works of Neural Tangent Kernel (NTK), we systematically study the dynamics of adversarial training and prove the existence of trainable sparse sub-network at initialization which can be trained to be adversarial robust from scratch. This theoretically verifies the lottery ticket hypothesis in adversarial context and we refer such sub-network structure as Adversarial Winning Ticket (AWT). We also show empirical evidences that AWT preserves the dynamics of adversarial training and achieve equal performance as dense adversarial training.
PHEW: Constructing Sparse Networks that Learn Fast and Generalize Well without Training Data. [paper] [code]
- Shreyas Malakarjun Patil, Constantine Dovrolis. ICLR 2021
- Key Word: Lottery Ticket Hypothesis; Neural Tangent Kernel.
Digest
Our work is based on a recently proposed decomposition of the Neural Tangent Kernel (NTK) that has decoupled the dynamics of the training process into a data-dependent component and an architecture-dependent kernel - the latter referred to as Path Kernel. That work has shown how to design sparse neural networks for faster convergence, without any training data, using the Synflow-L2 algorithm. We first show that even though Synflow-L2 is optimal in terms of convergence, for a given network density, it results in sub-networks with “bottleneck” (narrow) layers - leading to poor performance as compared to other data-agnostic methods that use the same number of parameters.
A Gradient Flow Framework For Analyzing Network Pruning. [paper] [code]
- Ekdeep Singh Lubana, Robert P. Dick. ICLR 2021
- Key Word: Lottery Ticket Hypothesis.
Digest
Recent network pruning methods focus on pruning models early-on in training. To estimate the impact of removing a parameter, these methods use importance measures that were originally designed to prune trained models. Despite lacking justification for their use early-on in training, such measures result in surprisingly low accuracy loss. To better explain this behavior, we develop a general framework that uses gradient flow to unify state-of-the-art importance measures through the norm of model parameters.
Sanity-Checking Pruning Methods: Random Tickets can Win the Jackpot. [paper] [code]
- Jingtong Su, Yihang Chen, Tianle Cai, Tianhao Wu, Ruiqi Gao, Liwei Wang, Jason D. Lee. NeurIPS 2020
- Key Word: Lottery Ticket Hypothesis.
Digest
We conduct sanity checks for the above beliefs on several recent unstructured pruning methods and surprisingly find that: (1) A set of methods which aims to find good subnetworks of the randomly-initialized network (which we call “initial tickets”), hardly exploits any information from the training data; (2) For the pruned networks obtained by these methods, randomly changing the preserved weights in each layer, while keeping the total number of preserved weights unchanged per layer, does not affect the final performance.
Pruning Neural Networks at Initialization: Why are We Missing the Mark? [paper]
- Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin. ICLR 2021
- Key Word: Lottery Ticket Hypothesis.
Digest
Recent work has explored the possibility of pruning neural networks at initialization. We assess proposals for doing so: SNIP (Lee et al., 2019), GraSP (Wang et al., 2020), SynFlow (Tanaka et al., 2020), and magnitude pruning. Although these methods surpass the trivial baseline of random pruning, they remain below the accuracy of magnitude pruning after training, and we endeavor to understand why. We show that, unlike pruning after training, randomly shuffling the weights these methods prune within each layer or sampling new initial values preserves or improves accuracy. As such, the per-weight pruning decisions made by these methods can be replaced by a per-layer choice of the fraction of weights to prune. This property suggests broader challenges with the underlying pruning heuristics, the desire to prune at initialization, or both.
ESPN: Extremely Sparse Pruned Networks. [paper] [code]
- Minsu Cho, Ameya Joshi, Chinmay Hegde.
- Key Word: Lottery Ticket Hypothesis.
Digest
Deep neural networks are often highly overparameterized, prohibiting their use in compute-limited systems. However, a line of recent works has shown that the size of deep networks can be considerably reduced by identifying a subset of neuron indicators (or mask) that correspond to significant weights prior to training. We demonstrate that an simple iterative mask discovery method can achieve state-of-the-art compression of very deep networks. Our algorithm represents a hybrid approach between single shot network pruning methods (such as SNIP) with Lottery-Ticket type approaches. We validate our approach on several datasets and outperform several existing pruning approaches in both test accuracy and compression ratio.
Logarithmic Pruning is All You Need. [paper]
- Laurent Orseau, Marcus Hutter, Omar Rivasplata. NeurIPS 2020
- Key Word: Lottery Ticket Hypothesis.
Digest
The Lottery Ticket Hypothesis is a conjecture that every large neural network contains a subnetwork that, when trained in isolation, achieves comparable performance to the large network. An even stronger conjecture has been proven recently: Every sufficiently overparameterized network contains a subnetwork that, at random initialization, but without training, achieves comparable accuracy to the trained large network. This latter result, however, relies on a number of strong assumptions and guarantees a polynomial factor on the size of the large network compared to the target function. In this work, we remove the most limiting assumptions of this previous work while providing significantly tighter bounds:the overparameterized network only needs a logarithmic factor (in all variables but depth) number of neurons per weight of the target subnetwork.
Exploring Weight Importance and Hessian Bias in Model Pruning. [paper]
- Mingchen Li, Yahya Sattar, Christos Thrampoulidis, Samet Oymak.
- Key Word: Lottery Ticket Hypothesis.
Digest
Model pruning is an essential procedure for building compact and computationally-efficient machine learning models. A key feature of a good pruning algorithm is that it accurately quantifies the relative importance of the model weights. While model pruning has a rich history, we still don’t have a full grasp of the pruning mechanics even for relatively simple problems involving linear models or shallow neural nets. In this work, we provide a principled exploration of pruning by building on a natural notion of importance.
Progressive Skeletonization: Trimming more fat from a network at initialization. [paper] [code]
- Pau de Jorge, Amartya Sanyal, Harkirat S. Behl, Philip H.S. Torr, Gregory Rogez, Puneet K. Dokania. ICLR 2021
- Key Word: Lottery Ticket Hypothesis.
Digest
Recent studies have shown that skeletonization (pruning parameters) of networks at initialization provides all the practical benefits of sparsity both at inference and training time, while only marginally degrading their performance. However, we observe that beyond a certain level of sparsity (approx 95%), these approaches fail to preserve the network performance, and to our surprise, in many cases perform even worse than trivial random pruning. To this end, we propose an objective to find a skeletonized network with maximum foresight connection sensitivity (FORCE) whereby the trainability, in terms of connection sensitivity, of a pruned network is taken into consideration.
Pruning neural networks without any data by iteratively conserving synaptic flow. [paper] [code]
- Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, Surya Ganguli.
- Key Word: Lottery Ticket Hypothesis.
Digest
Recent works have identified, through an expensive sequence of training and pruning cycles, the existence of winning lottery tickets or sparse trainable subnetworks at initialization. This raises a foundational question: can we identify highly sparse trainable subnetworks at initialization, without ever training, or indeed without ever looking at the data? We provide an affirmative answer to this question through theory driven algorithm design.
Finding trainable sparse networks through Neural Tangent Transfer. [paper] [code]
- Tianlin Liu, Friedemann Zenke. ICML 2020
- Key Word: Lottery Ticket Hypothesis; Neural Tangent Kernel.
Digest
We introduce Neural Tangent Transfer, a method that instead finds trainable sparse networks in a label-free manner. Specifically, we find sparse networks whose training dynamics, as characterized by the neural tangent kernel, mimic those of dense networks in function space. Finally, we evaluate our label-agnostic approach on several standard classification tasks and show that the resulting sparse networks achieve higher classification performance while converging faster.
What is the State of Neural Network Pruning? [paper] [code]
- Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, John Guttag. MLSys 2020
- Key Word: Lottery Ticket Hypothesis; Survey.
Digest
Neural network pruning—-the task of reducing the size of a network by removing parameters—-has been the subject of a great deal of work in recent years. We provide a meta-analysis of the literature, including an overview of approaches to pruning and consistent findings in the literature. After aggregating results across 81 papers and pruning hundreds of models in controlled conditions, our clearest finding is that the community suffers from a lack of standardized benchmarks and metrics. This deficiency is substantial enough that it is hard to compare pruning techniques to one another or determine how much progress the field has made over the past three decades. To address this situation, we identify issues with current practices, suggest concrete remedies, and introduce ShrinkBench, an open-source framework to facilitate standardized evaluations of pruning methods.
Comparing Rewinding and Fine-tuning in Neural Network Pruning. [paper] [code]
- Alex Renda, Jonathan Frankle, Michael Carbin. ICLR 2020
- Key Word: Lottery Ticket Hypothesis.
Digest
We compare fine-tuning to alternative retraining techniques. Weight rewinding (as proposed by Frankle et al., (2019)), rewinds unpruned weights to their values from earlier in training and retrains them from there using the original training schedule. Learning rate rewinding (which we propose) trains the unpruned weights from their final values using the same learning rate schedule as weight rewinding. Both rewinding techniques outperform fine-tuning, forming the basis of a network-agnostic pruning algorithm that matches the accuracy and compression ratios of several more network-specific state-of-the-art techniques.
Good Subnetworks Provably Exist: Pruning via Greedy Forward Selection. [paper] [code]
- Mao Ye, Chengyue Gong, Lizhen Nie, Denny Zhou, Adam Klivans, Qiang Liu. ICML 2020
- Key Word: Lottery Ticket Hypothesis.
Digest
Recent empirical works show that large deep neural networks are often highly redundant and one can find much smaller subnetworks without a significant drop of accuracy. However, most existing methods of network pruning are empirical and heuristic, leaving it open whether good subnetworks provably exist, how to find them efficiently, and if network pruning can be provably better than direct training using gradient descent. We answer these problems positively by proposing a simple greedy selection approach for finding good subnetworks, which starts from an empty network and greedily adds important neurons from the large network.
The Early Phase of Neural Network Training. [paper] [code]
- Jonathan Frankle, David J. Schwab, Ari S. Morcos. ICLR 2020
- Key Word: Lottery Ticket Hypothesis.
Digest
We find that, within this framework, deep networks are not robust to reinitializing with random weights while maintaining signs, and that weight distributions are highly non-independent even after only a few hundred iterations.
Robust Pruning at Initialization. [paper]
- Soufiane Hayou, Jean-Francois Ton, Arnaud Doucet, Yee Whye Teh.
- Key Word: Lottery Ticket Hypothesis.
Digest
we provide a comprehensive theoretical analysis of Magnitude and Gradient based pruning at initialization and training of sparse architectures. This allows us to propose novel principled approaches which we validate experimentally on a variety of NN architectures.
Picking Winning Tickets Before Training by Preserving Gradient Flow. [paper] [code]
- Chaoqi Wang, Guodong Zhang, Roger Grosse. ICLR 2020
- Key Word: Lottery Ticket Hypothesis.
Digest
We aim to prune networks at initialization, thereby saving resources at training time as well. Specifically, we argue that efficient training requires preserving the gradient flow through the network. This leads to a simple but effective pruning criterion we term Gradient Signal Preservation (GraSP).
Lookahead: A Far-Sighted Alternative of Magnitude-based Pruning. [paper] [code]
- Sejun Park, Jaeho Lee, Sangwoo Mo, Jinwoo Shin. ICLR 2020
- Key Word: Lottery Ticket Hypothesis.
Digest
Magnitude-based pruning is one of the simplest methods for pruning neural networks. Despite its simplicity, magnitude-based pruning and its variants demonstrated remarkable performances for pruning modern architectures. Based on the observation that magnitude-based pruning indeed minimizes the Frobenius distortion of a linear operator corresponding to a single layer, we develop a simple pruning method, coined lookahead pruning, by extending the single layer optimization to a multi-layer optimization.
Feature learning in neural networks and kernel machines that recursively learn features. [paper] [code]
- Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, Mikhail Belkin.
- Key Word: Feature Learning; Kernel Machines; Grokking; Lottery Ticket Hypothesis.
Digest
We isolate the key mechanism driving feature learning in fully connected neural networks by connecting neural feature learning to the average gradient outer product. We subsequently leverage this mechanism to design Recursive Feature Machines (RFMs), which are kernel machines that learn features. We show that RFMs (1) accurately capture features learned by deep fully connected neural networks, (2) close the gap between kernel machines and fully connected networks, and (3) surpass a broad spectrum of models including neural networks on tabular data.
Grokking phase transitions in learning local rules with gradient descent. [paper]
- Bojan Žunkovič, Enej Ilievski.
- Key Word: Tensor Network; Grokking; Many-Body Quantum Mechanics; Neural Collapse.
Digest
We discuss two solvable grokking (generalisation beyond overfitting) models in a rule learning scenario. We show that grokking is a phase transition and find exact analytic expressions for the critical exponents, grokking probability, and grokking time distribution. Further, we introduce a tensor-network map that connects the proposed grokking setup with the standard (perceptron) statistical learning theory and show that grokking is a consequence of the locality of the teacher model. As an example, we analyse the cellular automata learning task, numerically determine the critical exponent and the grokking time distributions and compare them with the prediction of the proposed grokking model. Finally, we numerically analyse the connection between structure formation and grokking.
Broken Neural Scaling Laws. [paper] [code]
- Ethan Caballero, Kshitij Gupta, Irina Rish, David Krueger.
- Key Word: Neural Scaling Laws.
Digest
We present a smoothly broken power law functional form that accurately models and extrapolates the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, or training dataset size varies) for each task within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings. This set includes large-scale vision and unsupervised language tasks, diffusion generative modeling of images, arithmetic, and reinforcement learning.
How Much Data Are Augmentations Worth? An Investigation into Scaling Laws, Invariance, and Implicit Regularization. [paper] [code]
- Jonas Geiping, Micah Goldblum, Gowthami Somepalli, Ravid Shwartz-Ziv, Tom Goldstein, Andrew Gordon Wilson.
- Key Word: Data Augmentation; Neural Scaling Laws; Implicit Regularization.
Digest
Despite the clear performance benefits of data augmentations, little is known about why they are so effective. In this paper, we disentangle several key mechanisms through which data augmentations operate. Establishing an exchange rate between augmented and additional real data, we find that in out-of-distribution testing scenarios, augmentations which yield samples that are diverse, but inconsistent with the data distribution can be even more valuable than additional training data.
Omnigrok: Grokking Beyond Algorithmic Data. [paper]
- Ziming Liu, Eric J. Michaud, Max Tegmark.
- Key Word: Grokking Dynamics.
Digest
Grokking, the unusual phenomenon for algorithmic datasets where generalization happens long after overfitting the training data, has remained elusive. We aim to understand grokking by analyzing the loss landscapes of neural networks, identifying the mismatch between training and test losses as the cause for grokking. We refer to this as the “LU mechanism” because training and test losses (against model weight norm) typically resemble “L” and “U”, respectively. This simple mechanism can nicely explain many aspects of grokking: data size dependence, weight decay dependence, the emergence of representations, etc.
Revisiting Neural Scaling Laws in Language and Vision. [paper]
- Ibrahim Alabdulmohsin, Behnam Neyshabur, Xiaohua Zhai.
- Key Word: Neural Scaling Laws; Multi-modal Learning.
Digest
The remarkable progress in deep learning in recent years is largely driven by improvements in scale, where bigger models are trained on larger datasets for longer schedules. To predict the benefit of scale empirically, we argue for a more rigorous methodology based on the extrapolation loss, instead of reporting the best-fitting (interpolating) parameters. We then present a recipe for estimating scaling law parameters reliably from learning curves. We demonstrate that it extrapolates more accurately than previous methods in a wide range of architecture families across several domains, including image classification, neural machine translation (NMT) and language modeling, in addition to tasks from the BIG-Bench evaluation benchmark.
On the Principles of Parsimony and Self-Consistency for the Emergence of Intelligence. [paper]
- Yi Ma, Doris Tsao, Heung-Yeung Shum.
- Key Word: Intelligence; Parsimony; Self-Consistency; Rate Reduction.
Digest
Ten years into the revival of deep networks and artificial intelligence, we propose a theoretical framework that sheds light on understanding deep networks within a bigger picture of Intelligence in general. We introduce two fundamental principles, Parsimony and Self-consistency, that we believe to be cornerstones for the emergence of Intelligence, artificial or natural. While these two principles have rich classical roots, we argue that they can be stated anew in entirely measurable and computable ways.
Synergy and Symmetry in Deep Learning: Interactions between the Data, Model, and Inference Algorithm. [paper]
- Lechao Xiao, Jeffrey Pennington. ICML 2022
- Key Word: Synergy; Symmetry; Implicit Bias; Neural Tangent Kernel; Neural Scaling Laws.
Digest
Although learning in high dimensions is commonly believed to suffer from the curse of dimensionality, modern machine learning methods often exhibit an astonishing power to tackle a wide range of challenging real-world learning problems without using abundant amounts of data. How exactly these methods break this curse remains a fundamental open question in the theory of deep learning. While previous efforts have investigated this question by studying the data (D), model (M), and inference algorithm (I) as independent modules, in this paper, we analyze the triplet (D, M, I) as an integrated system and identify important synergies that help mitigate the curse of dimensionality.
How Much More Data Do I Need? Estimating Requirements for Downstream Tasks. [paper]
- Rafid Mahmood, James Lucas, David Acuna, Daiqing Li, Jonah Philion, Jose M. Alvarez, Zhiding Yu, Sanja Fidler, Marc T. Law. CVPR 2022
- Key Word: Neural Scaling Laws; Active Learning.
Digest
Prior work on neural scaling laws suggest that the power-law function can fit the validation performance curve and extrapolate it to larger data set sizes. We find that this does not immediately translate to the more difficult downstream task of estimating the required data set size to meet a target performance. In this work, we consider a broad class of computer vision tasks and systematically investigate a family of functions that generalize the power-law function to allow for better estimation of data requirements.
Beyond neural scaling laws: beating power law scaling via data pruning. [paper]
- Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, Ari S. Morcos.
- Key Word: Dataset Pruning; Ensemble Active Learning.
Digest
Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, these improvements through scaling alone require considerable costs in compute and energy. Here we focus on the scaling of error with dataset size and show how both in theory and practice we can break beyond power law scaling and reduce it to exponential scaling instead if we have access to a high-quality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size. We then test this new exponential scaling prediction with pruned dataset size empirically, and indeed observe better than power law scaling performance on ResNets trained on CIFAR-10, SVHN, and ImageNet.
Exact Phase Transitions in Deep Learning. [paper]
- Liu Ziyin, Masahito Ueda.
- Key Word: Phase Transitions; Symmetry Breaking; Mean-Field Analysis; Statistical Physics.
Digest
The paper presents a theory that demonstrates the existence of first-order and second-order phase transitions in deep learning, similar to those observed in statistical physics, by analyzing the interplay between prediction error and model complexity in the training loss. The findings have implications for neural network optimization and shed light on the origin of the posterior collapse problem in Bayesian deep learning.
Towards Understanding Grokking: An Effective Theory of Representation Learning. [paper]
- Ziming Liu, Ouail Kitouni, Niklas Nolte, Eric J. Michaud, Max Tegmark, Mike Williams.
- Key Word: Grokking; Physics of Learning; Deep Double Descent.
Digest
We aim to understand grokking, a phenomenon where models generalize long after overfitting their training set. We present both a microscopic analysis anchored by an effective theory and a macroscopic analysis of phase diagrams describing learning performance across hyperparameters. We find that generalization originates from structured representations whose training dynamics and dependence on training set size can be predicted by our effective theory in a toy setting. We observe empirically the presence of four learning phases: comprehension, grokking, memorization, and confusion.
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. [paper] [code]
- Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, Vedant Misra.
- Key Word: Grokking; Overfitting.
Digest
In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of “grokking” a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting.
Theoretical limitations of multi-layer Transformer. [paper]
- Lijie Chen, Binghui Peng, Hongxun Wu.
- Key Word: Transformer; Chain-of-Thought.
Digest
This paper establishes the first unconditional lower bound on the expressive power of multi-layer decoder-only transformers, demonstrating that they require polynomially large dimensions to perform sequential composition of L functions over n tokens for any constant L. It reveals that multi-layer transformers face an exponential depth-width trade-off, where fewer layers make tasks exponentially harder, and highlights an advantage of encoders over decoders for certain tasks, as well as the exponential simplification of tasks using chain-of-thought reasoning. The authors introduce a novel multi-party autoregressive communication model and a new proof technique for deriving lower bounds, providing foundational tools for understanding the computational power of transformers.
Initialization Matters: On the Benign Overfitting of Two-Layer ReLU CNN with Fully Trainable Layers. [paper]
- Shuning Shang, Xuran Meng, Yuan Cao, Difan Zou.
- Key Word: Benign Overfitting; Feature Learning.
Digest
This paper studies benign overfitting in over-parameterized neural networks, specifically focusing on two-layer ReLU convolutional neural networks (CNNs) with fully trainable layers. It finds that the initialization scale of the output layer significantly affects training dynamics. Large initialization scales make training similar to fixed-output scenarios, with the hidden layer growing while the output layer remains stable. Small scales lead to complex interactions where both layers grow proportionally. The paper also provides bounds on test errors, identifying conditions on initialization scale and signal-to-noise ratio (SNR) that determine whether benign overfitting occurs. Numerical experiments support these findings.
Simplicity Bias via Global Convergence of Sharpness Minimization. [paper]
- Khashayar Gatmiry, Zhiyuan Li, Sashank J. Reddi, Stefanie Jegelka.
- Key Word: Simplicity Bias; Sharpness Minimization.
Digest
This paper investigates the connection between the generalization ability of neural networks, typically attributed to the implicit bias of stochastic gradient descent (SGD), and the simplicity of the final trained model, particularly in relation to low-rank features. The authors focus on label noise SGD, which tends to converge to flatter regions of the loss landscape. They demonstrate that for two-layer neural networks, label noise SGD converges to a solution where all neurons replicate a single linear feature, leading to a rank-one feature matrix. Their key contribution is showing that label noise SGD minimizes sharpness on the zero-loss manifold and discovering a novel property of local geodesic convexity in the trace of the Hessian.
Loss Landscape Characterization of Neural Networks without Over-Parametrziation. [paper]
- Rustem Islamov, Niccolò Ajroldi, Antonio Orvieto, Aurelien Lucchi.
- Key Word: Loss Landscape; Over-Parameterization; Invexity.
Digest
This paper addresses the challenge of ensuring convergence in optimization methods for deep learning, where the loss landscapes are non-convex. While the Polyak-Lojasiewicz (PL) inequality offers a common structural condition for convergence, it requires impractical over-parametrization in deep networks. The authors propose a new class of functions that describe the loss landscape of modern deep models without needing extensive over-parametrization and can account for saddle points. They prove that gradient-based optimizers converge under this new assumption and support it with theoretical analysis and empirical experiments.
Leveraging free energy in pretraining model selection for improved fine-tuning. [paper]
- Michael Munn, Susan Wei.
- Key Word: Model Selection; Free Energy.
Digest
This paper explores the success of the pretrain-then-adapt paradigm in artificial intelligence models, like BERT and GPT, and introduces a Bayesian model selection criterion called downstream free energy. This criterion evaluates a model’s adaptability to downstream tasks by measuring the concentration of favorable parameters near the pretrained checkpoint, without needing access to the downstream data. The authors show that this criterion correlates with improved fine-tuning performance, providing a way to predict how well pretrained models will adapt to new tasks.
Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective. [paper]
- Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, Tengyu Ma.
- Key Word: Gradient Descent Dynamics; Loss Landscape.
Digest
The paper introduces the Warmup-Stable-Decay (WSD) learning rate schedule, which allows training language models without a pre-fixed compute budget. WSD uses a constant learning rate for most of the training (stable phase) and then applies a rapidly decaying learning rate (decay phase) to produce strong models. Unlike traditional schedules, WSD creates a loss curve where the loss stays high during the stable phase and sharply drops in the decay phase. The authors explain this using a “river valley” landscape analogy, where large oscillations during the stable phase drive fast progress, and the decay phase fine-tunes the optimization. They propose WSD-S, a variant that reuses decayed checkpoints, outperforming WSD and Cyclic-Cosine in generating language models across different compute budgets.
The Optimization Landscape of SGD Across the Feature Learning Strength. [paper]
- Alexander Atanasov, Alexandru Meterez, James B. Simon, Cengiz Pehlevan.
- Key Word: Loss Landscape; Feature Learning.
Digest
This paper investigates the effect of scaling a neural network’s final layer by a hyperparameter γ, which controls feature learning dynamics. The study explores how γ interacts with the learning rate η across various models and datasets in online training. The authors identify optimal learning rate scaling regimes, where η* scales with γ² when γ is small and with γ²/L for deep networks when γ is large. In the under-explored “ultra-rich” γ≫1 regime, networks exhibit distinctive loss curves with plateaus and steps, optimizing similarly across large γ values. The study highlights the importance of tuning γ for optimal performance and suggests further analytical exploration of the large-γ limit.
Provable Weak-to-Strong Generalization via Benign Overfitting. [paper]
- David X. Wu, Anant Sahai.
- Key Word: Weak-to-Strong Generalization; Benign Overfitting.
Digest
This paper explores weak-to-strong generalization, where a weak teacher supervises a strong student using imperfect pseudolabels, as introduced by Burns et al. (2023). The authors theoretically analyze this paradigm for binary and multilabel classification in an overparameterized Gaussian model, where the weak teacher’s pseudolabels are nearly random. They identify two outcomes for the student: successful generalization or random guessing. Their results highlight the importance of logits for weak supervision and include a new tight lower bound for the maximum of correlated Gaussians, potentially useful for extending to multiclass classification.
Autoregressive Large Language Models are Computationally Universal. [paper]
- Dale Schuurmans, Hanjun Dai, Francesco Zanini.
- Key Word: Autoregressive Large Language Model; Univeral Turing Machine.
Digest
This paper demonstrates that autoregressive decoding of a transformer-based language model can achieve universal computation without modifying the model’s weights. The authors introduce a generalization of autoregressive decoding, where emitted tokens extend the context window for processing long inputs. They show this system corresponds to a Lag system, a known computationally universal model. By proving a universal Turing machine can be simulated with 2027 production rules, they test whether a large language model can mimic this behavior. They confirm that gemini-1.5-pro-001, with a specific prompt and greedy decoding, can function as a general-purpose computer under the Church-Turing thesis.
On the Geometry of Deep Learning. [paper]
- Randall Balestriero, Ahmed Imtiaz Humayun, Richard Baraniuk.
- Key Word: Geometry.
Digest
This paper explores the mathematical foundations of deep learning, focusing on the connection between deep networks and function approximation using affine splines (piecewise linear functions). It reviews recent work on the geometrical properties of deep networks, particularly how they tessellate input space, demonstrating how this perspective can enhance our understanding and optimization of deep networks.
Understanding Finetuning for Factual Knowledge Extraction. [paper]
- Gaurav Ghosal, Tatsunori Hashimoto, Aditi Raghunathan.
- Key Word: Fine-Tuning; Factual Knowledge.
Digest
Fine-tuning question-answering models on lesser-known facts results in worse factuality compared to using well-known facts, as models may produce generic responses rather than accurate ones. This study shows that fine-tuning with well-known data improves performance, highlighting the need to consider the storage of facts in pretrained models for effective fine-tuning.
Hardness of Learning Neural Networks under the Manifold Hypothesis. [paper]
- Bobak T. Kiani, Jason Wang, Melanie Weber.
- Key Word: Manifold Hypothesis; Hardness of Learning Neural Networks.
Digest
The paper investigates the difficulty of learning neural networks under the manifold hypothesis, which posits that high-dimensional data lies on or near a low-dimensional manifold. It demonstrates that learning is hard under manifolds of bounded curvature but becomes feasible with additional assumptions on the manifold’s volume, suggesting that certain geometric properties can significantly impact the learnability of neural networks.
Is In-Context Learning in Large Language Models Bayesian? A Martingale Perspective. [paper]
- Fabian Falck, Ziyu Wang, Chris Holmes.
- Key Word: In-Context Learning; Bayesian Inference.
Digest
The paper examines the hypothesis that in-context learning (ICL) in large language models (LLMs) functions as Bayesian inference by analyzing the martingale property, a key requirement for Bayesian learning with exchangeable data. The authors find that while the martingale property is necessary for unambiguous predictions and principled uncertainty, their experiments reveal violations of this property and deviations from expected Bayesian behavior, thus challenging the hypothesis that ICL is inherently Bayesian.
Why Larger Language Models Do In-context Learning Differently? [paper]
- Zhenmei Shi, Junyi Wei, Zhuoyan Xu, Yingyu Liang.
- Key Word: Large Language Model; In-Context Learning.
Digest
Large language models (LLMs) exhibit in-context learning (ICL), performing well on new tasks using brief examples without parameter adjustments. This study theoretically explores why larger models are more sensitive to noise, finding that smaller models focus on key features and are more robust, while larger models cover more features and are more easily distracted, supported by preliminary experimental results.
Towards a Theoretical Understanding of the ‘Reversal Curse’ via Training Dynamics. [paper]
- Hanlin Zhu, Baihe Huang, Shaolun Zhang, Michael Jordan, Jiantao Jiao, Yuandong Tian, Stuart Russell.
- Key Word: Large Language Model; Reasoning; Training Dynamics; Reversal Curse.
Digest
Auto-regressive large language models struggle with simple logical reasoning tasks like inverse search, known as the “reversal curse.” Through analyzing the training dynamics of two auto-regressive models, this paper reveals that the asymmetry in the weights is a core reason for the reversal curse and shows the necessity of chain-of-thought for one-layer transformers.
Understanding LLMs Requires More Than Statistical Generalization. [paper]
- Patrik Reizinger, Szilvia Ujváry, Anna Mészáros, Anna Kerekes, Wieland Brendel, Ferenc Huszár.
- Key Word: Large Language Model; Generalization Measure; Transferability; Inductive Biases.
Digest
The last decade has seen blossoming research in deep learning theory attempting to answer, ‘Why does deep learning generalize?’” argues for a shift in perspective in understanding the desirable qualities of Language Models (LLMs). The authors highlight the non-identifiability of AR probabilistic models, where models with zero or near-zero KL divergence can exhibit different behaviors. They provide mathematical examples and empirical observations to support their argument and discuss the practical relevance of non-identifiability through three case studies. The paper concludes by reviewing research directions focusing on LLM-relevant generalization measures, transferability, and inductive biases.
Categorical Deep Learning: An Algebraic Theory of Architectures. [paper]
- Bruno Gavranović, Paul Lessard, Andrew Dudzik, Tamara von Glehn, João G. M. Araújo, Petar Veličković.
- Key Word: Category Theory.
Digest
The abstract discusses the challenge of creating a universal framework for defining and analyzing deep learning architectures. It criticizes previous efforts for failing to effectively link the theoretical constraints of models with their practical implementations. The authors suggest using category theory, specifically the universal algebra of monads within a 2-category of parametric maps, as a comprehensive theory that can encompass both theoretical and practical aspects of neural network design. They argue that this approach can accurately represent constraints found in geometric deep learning and implementations across various neural network architectures, including Recurrent Neural Networks (RNNs). Additionally, they demonstrate how their theory can naturally express standard concepts in computer science and automata theory.
A PAC-Bayesian Link Between Generalisation and Flat Minima. [paper]
- Maxime Haddouche, Paul Viallard, Umut Simsekli, Benjamin Guedj.
- Key Word: PAC-Bayes Generalization; Flat Minima.
Digest
The paper presents new generalization bounds for machine learning predictors in overparameterized settings, where the number of parameters exceeds dataset size. These bounds, which focus on gradient terms, are derived using the PAC-Bayes framework along with Poincaré and Log-Sobolev inequalities, circumventing the need for explicit consideration of predictor space dimension. The findings emphasize the beneficial impact of flat minima on generalization performance, underscoring the advantage of the optimization phase in enhancing model generalizability without directly depending on the model’s complexity or dataset size.
Tighter Generalisation Bounds via Interpolation. [paper]
- Paul Viallard, Maxime Haddouche, Umut Şimşekli, Benjamin Guedj.
- Key Word: PAC-Bayes Generalization Bounds.
Digest
This paper introduces a method for creating new PAC-Bayes generalization bounds using the (f,Γ)-divergence. It also presents interpolated PAC-Bayes bounds across various probability divergences, such as KL, Wasserstein, and total variation, tailored to the properties of posterior distributions. The study evaluates the tightness of these bounds and links them to established results in statistical learning, identifying them as specific instances. Furthermore, by implementing these bounds as training objectives, the paper demonstrates their effectiveness in providing significant guarantees and practical performance improvements in machine learning models.
Provably learning a multi-head attention layer. [paper]
- Sitan Chen, Yuanzhi Li.
- Key Word: Multi-Head Attention; Learning Theory.
Digest
The paper discusses the multi-head attention layer, a crucial feature of the transformer architecture that differentiates it from conventional feed-forward models. It explores the theoretical aspects of learning a multi-head attention layer through random examples, presenting the first significant upper and lower bounds for this challenge. The findings include a method that can learn the function of the multi-head attention layer with small error under specific conditions, using random labeled examples from a defined set. The study also indicates that an exponential dependency on the number of attention heads (m) is inevitable in the worst-case scenarios. This research uses Boolean inputs to reflect the discrete nature of tokens in large language models but notes that the approach can be adapted to continuous settings. The proposed algorithm, which diverges from traditional methods by focusing on shaping a convex body around unknown parameters, offers a new direction in provable learning algorithms beyond the common reliance on the Gaussian distribution’s properties.
Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling. [paper]
- Mingze Wang, Weinan E.
- Key Word: Expressivity; Transformer; Self-Attent Mechanism; Positional Ecoding.
Digest
We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the dot-product self-attention, positional encoding and feed-forward layer, affect its expressive power, and we study their combined effects through establishing explicit approximation rates. Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads, and these insights also provide natural suggestions for alternative architectures.
Residual Alignment: Uncovering the Mechanisms of Residual Networks. [paper]
- Jianing Li, Vardan Papyan. NeurIPS 2023
- Key Word: ResNet; Neural Collapse; Neural ODE; Optimal Transport.
Digest
This paper examines the ResNet architecture in deep learning, focusing on understanding its effectiveness through an analysis of its residual blocks. The study uncovers a phenomenon called Residual Alignment (RA), characterized by: Even distribution of intermediate representations in high-dimensional space (RA1). Alignment of singular vectors in Residual Jacobians across different network depths (RA2). Limitation of Residual Jacobians’ rank by the number of classes in fully-connected ResNets (RA3). Inverse scaling of top singular values with network depth (RA4). Residual Alignment is found to be crucial for the model’s performance, occurring in well-generalizing models across various architectures and datasets. The absence of RA when skip connections are removed highlights their importance. The paper also proposes a mathematical model supporting these findings.
A Survey on Statistical Theory of Deep Learning: Approximation, Training Dynamics, and Generative Models. [paper]
- Namjoon Suh, Guang Cheng.
- Key Word: Survey; Learning Theory; Neural Tangent Kernel; Mean-Field Theory; Approximation Theory; Generative Modeling.
Digest
The paper reviews statistical theories of neural networks, focusing on three areas: first, it analyzes neural network risks and construction within nonparametric frameworks, noting limitations in the analysis of deep networks. Second, it discusses training dynamics, especially how networks trained via gradient-based methods generalize. This section reviews two key paradigms: Neural Tangent Kernel (NTK) and Mean-Field (MF). Finally, it examines advances in generative models, notably Generative Adversarial Networks (GANs), diffusion models, and in-context learning in Large Language Models. The paper concludes with future directions for deep learning theory.
Learning Theory from First Principles. [paper]
- Francis Bach.
- Key Word: Learning Theory; Book.
Digest
The goal of the class (and thus of this textbook) is to present old and recent results in learning theory for the most widely-used learning architectures. This class is geared towards theory-oriented students as well as students who want to acquire a basic mathematical understanding of algorithms used throughout machine learning and associated fields that are significant users of learning methods such as computer vision or natural language processing.
Understanding the Regularity of Self-Attention with Optimal Transport. [paper]
- Valérie Castin, Pierre Ablin, Gabriel Peyré.
- Key Word: Self-Attention; Optimal Transport.
Digest
This paper analyzes the robustness of self-attention mechanisms in Transformers from a theoretical perspective. It studies the local Lipschitz constant of self-attention as a way to measure robustness agnostic to specific attacks. Using a measure-theoretic framework with the Wasserstein distance, it derives bounds on the Lipschitz constant on compact input spaces, showing it grows exponentially with input radius. It also finds measures with high Lipschitz constants typically have unbalanced mass concentrated in a few locations. Finally, it examines self-attention stability under perturbations changing token numbers, identifying a “mass splitting” phenomenon where duplicating tokens before perturbation can be a more effective attack.
Challenges with unsupervised LLM knowledge discovery. [paper]
- Sebastian Farquhar, Vikrant Varma, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, Rohin Shah.
- Key Word: Large Language Model; Unsuperivsed Knowledge Discovery.
Digest
The paper demonstrates that current unsupervised methods for large language models do not effectively uncover knowledge, but rather emphasize prominent features of the model’s activations. It introduces the concept of consistency structure for knowledge elicitation and presents experiments revealing that unsupervised methods may prioritize a different prominent feature over knowledge. The paper concludes that existing unsupervised methods are inadequate for discovering latent knowledge and suggests sanity checks for evaluating future knowledge elicitation methods. Additionally, it hypothesizes that identification issues, such as distinguishing a model’s knowledge from that of a simulated character’s, will persist in future unsupervised methods.
Proving Linear Mode Connectivity of Neural Networks via Optimal Transport. [paper]
- Damien Ferbach, Baptiste Goujaud, Gauthier Gidel, Aymeric Dieuleveut.
- Key Word: Linear Mode Connectivity; Optimal Transport.
Digest
This paper explores the energy landscape of high-dimensional non-convex optimization problems in deep neural networks. It theoretically explains the empirical observation that different solutions found in stochastic training are often connected by simple continuous paths, such as linear ones. The framework is based on convergence rates in Wasserstein distance, showing that wide two-layer neural networks trained with stochastic gradient descent are linearly connected with high probability. The paper also provides upper and lower bounds on the layer width for linear connectivity in deep neural networks. Empirical evidence supports the approach, linking the dimension of weight distribution support with Wasserstein convergence rates and linear mode connectivity.
Benign Oscillation of Stochastic Gradient Descent with Large Learning Rates. [paper]
- Miao Lu, Beining Wu, Xiaodong Yang, Difan Zou.
- Key Word: Stochastic Gradient Descent; Large Learning Rate; Feature Learning.
Digest
This paper investigates the generalization of neural networks trained using a stochastic gradient descent (SGD) algorithm with large learning rates. The key finding is that the weight oscillations caused by this training regime, termed “benign oscillation,” can improve generalization compared to networks trained with smaller learning rates that converge more smoothly. The theory is based on a feature learning perspective and demonstrates that large learning rate SGD allows networks to effectively learn weak features in the presence of strong features. Experimental results support the concept of “benign oscillation.”
It’s an Alignment, Not a Trade-off: Revisiting Bias and Variance in Deep Models. [paper]
- Lin Chen, Michal Lukasik, Wittawat Jitkrittum, Chong You, Sanjiv Kumar.
- Key Word: Bias and Variance.
Digest
This paper challenges the conventional idea that bias and variance in machine learning trade off against each other. Instead, it demonstrates that, in deep learning ensemble models, bias and variance are closely related for correctly classified samples. The paper provides empirical evidence across different deep learning models and datasets. It also explores this phenomenon theoretically from two perspectives: calibration and neural collapse. The findings suggest a connection between bias and variance in these models.
Why Does Sharpness-Aware Minimization Generalize Better Than SGD? [paper]
- Zixiang Chen, Junkai Zhang, Yiwen Kou, Xiangning Chen, Cho-Jui Hsieh, Quanquan Gu. NeurIPS 2023
- Key Word: Sharpness-Aware Minimization.
Digest
This paper addresses the problem of overfitting in large neural networks and introduces Sharpness-Aware Minimization (SAM) as a method to improve generalization, even in the presence of label noise. It specifically investigates why SAM outperforms Stochastic Gradient Descent (SGD) in certain scenarios, using two-layer convolutional ReLU networks and a nonsmooth loss landscape. The paper’s findings suggest that SAM prevents early noise learning, making feature learning more effective. Experimental results on synthetic and real data support these theoretical insights.
Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP. [paper]
- Zixiang Chen, Yihe Deng, Yuanzhi Li, Quanquan Gu.
- Key Word: Zero-Shot Transfer; Multi-Modal Foundation Models.
Digest
This paper focuses on multi-modal learning, which combines information from different data sources like text and images to enhance model performance. It highlights the success of CLIP, a method that learns joint image and text representations through contrastive pretraining. While CLIP has shown practical success, this paper aims to provide a formal theoretical understanding of its representation learning and how it aligns features from different modalities. The paper also analyzes CLIP’s performance in zero-shot transfer tasks and introduces a new CLIP-type approach inspired by their analysis, which outperforms CLIP and other state-of-the-art methods on benchmark datasets.
Physics of Language Models: Part 3.2, Knowledge Manipulation. [paper]
- Zeyuan Allen-Zhu, Yuanzhi Li.
- Key Word: Large Language Models.
Digest
This paper investigates a language model’s ability to use its stored knowledge for various types of logical reasoning, including retrieval, classification, comparison, and inverse search. The study finds that pre-trained language models like GPT2/3/4 perform well in knowledge retrieval but struggle with classification and comparison tasks unless Chain of Thoughts (CoTs) are used during both training and inference. They also perform poorly in inverse knowledge search, regardless of the prompts. The paper’s main contribution is a synthetic dataset that confirms these limitations, showing that language models cannot efficiently manipulate their stored knowledge from pre-training data, even when the knowledge is perfectly stored and extractable in the models, and despite fine-tuning with appropriate instructions.
Physics of Language Models: Part 3.1, Knowledge Storage and Extraction. [paper]
- Zeyuan Allen Zhu, Yuanzhi Li.
- Key Word: Large Language Models.
Digest
This paper investigates how large language models answer questions, specifically whether they rely on memorization or genuine knowledge extraction. Using controlled semi-synthetic biography data, the study reveals a connection between the model’s knowledge extraction ability and diversity measures of the training data. The paper employs linear probing techniques, showing a strong correlation between this relationship and whether the model encodes knowledge attributes in a linear fashion within the entity names’ hidden embeddings or across other tokens in the training text.
Fantastic Generalization Measures are Nowhere to be Found. [paper]
- Michael Gastpar, Ido Nachum, Jonathan Shafer, Thomas Weinberger.
- Key Word: Generalization Bound; Overparameterization.
Digest
The paper discusses generalization bounds for neural networks in the overparameterized setting. It highlights that existing generalization bounds are not tight enough to explain neural network performance. The paper examines two common types of generalization bounds: those depending on training data and output and those considering the learning algorithm. It mathematically proves that no bound of the first type can be uniformly tight in the overparameterized setting. For the second type, it shows a trade-off between algorithm performance and bound tightness, suggesting that tight generalization bounds are not possible without suitable assumptions on the population distribution.
Implicit regularization of deep residual networks towards neural ODEs. [paper]
- Pierre Marion, Yu-Han Wu, Michael E. Sander, Gérard Biau.
- Key Word: Implicit Regularization; Neural Ordinary Differential Equations.
Digest
This paper establishes an implicit regularization connection between deep residual networks and neural ordinary differential equations (ODEs) when trained with gradient flow. It proves that if the network is initially set up as a discretization of a neural ODE, this relationship persists during training. These findings are valid for both finite training times and in the limit of infinite training time under certain conditions. The paper also demonstrates this connection in the context of specific residual network architectures and shows numerical experiments to support the results.
On the Implicit Bias of Adam. [paper]
- Matias D. Cattaneo, Jason M. Klusowski, Boris Shigida.
- Key Word: Adam; Implicit Bias; Ordinary Differential Equations.
Digest
This paper explores the concept of implicit regularization in optimization algorithms like RMSProp and Adam, comparing it to previous work on gradient descent trajectories. It demonstrates that these algorithms exhibit implicit regularization effects influenced by their hyperparameters and training stage. Specifically, they either penalize the one-norm of loss gradients or hinder its decrease. The paper supports these findings with numerical experiments and discusses their potential impact on generalization in machine learning.
Transformers as Support Vector Machines. [paper]
- Davoud Ataee Tarzanagh, Yingcong Li, Christos Thrampoulidis, Samet Oymak.
- Key Word: Transformer; Implicit Regularization.
Digest
This paper establishes a formal equivalence between self-attention in transformers and a hard-margin SVM problem. It characterizes the convergence behavior of 1-layer transformers optimized with gradient descent, showing that it can converge towards locally-optimal directions. The paper also demonstrates that over-parameterization facilitates global convergence and introduces a more general SVM equivalence for nonlinear heads. These findings suggest interpreting transformers as a hierarchy of SVMs for token selection and separation.
Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization. [paper]
- Kaiyue Wen, Tengyu Ma, Zhiyuan Li.
- Key Word: Sharpness-Aware Minimization.
Digest
The paper investigates the relationship between flatness and generalization in overparameterized neural networks. It identifies three scenarios for two-layer ReLU networks: (1) flatness implies generalization, (2) non-generalizing flattest models exist, and sharpness minimization algorithms fail to generalize, and (3) non-generalizing flattest models exist, but sharpness minimization algorithms still generalize. These findings indicate that the connection between sharpness and generalization depends on data distributions and model architectures, prompting the need to explore alternative explanations for the generalization of overparameterized neural networks.
On the curvature of the loss landscape. [paper]
- Alison Pouplin, Hrittik Roy, Sidak Pal Singh, Georgios Arvanitidis.
- Key Word: Loss Landscape; Scalar Curvature; Riemannian Manifold.
Digest
The paper investigates the challenge of understanding the excellent performance of over-parameterized deep learning models when trained on limited data. It proposes analyzing the generalization abilities of deep neural networks by treating the loss landscape as an embedded Riemannian manifold, focusing on the computable scalar curvature and its connections to potential generalization.
Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifurcation Theory. [paper]
- Minhak Song, Chulhee Yun.
- Key Word: Edge of Stability; Bifurcation Theory.
Digest
The paper explores the Edge of Stability (EoS) phenomenon observed in the evolution of the largest eigenvalue of the loss Hessian during gradient descent (GD) training. It demonstrates that GD trajectories, when EoS occurs, align on a specific bifurcation diagram, independent of initialization, and provides rigorous proofs for this trajectory alignment in specific network architectures.
How Deep Neural Networks Learn Compositional Data: The Random Hierarchy Model. [paper]
- Leonardo Petrini, Francesco Cagnetta, Umberto M. Tomasini, Alessandro Favero, Matthieu Wyart.
- Key Word: Synonymic Invariance; Random Hierarchy Model.
Digest
This paper explores how deep convolutional neural networks (CNNs) learn compositional data by investigating the Random Hierarchy Model, demonstrating that the number of training data required by deep CNNs grows asymptotically as a polynomial function of the input dimensionality.
Neural Hilbert Ladders: Multi-Layer Neural Networks in Function Space. [paper]
- Zhengdao Chen.
- Key Word: Reproducing Kernel Hilbert Spaces; Mean-Field Theory.
Digest
The paper introduces the concept of Neural Hilbert Ladders (NHL), which views a multi-layer neural network as a hierarchy of reproducing kernel Hilbert spaces (RKHSs). It provides a generalized function space and complexity measure for deep neural networks (DNNs) and explores their theoretical properties and implications. The paper establishes a correspondence between L-layer neural networks and L-level NHLs, proves generalization guarantees for learning an NHL, analyzes the dynamics of NHLs in the infinite-width mean-field limit, demonstrates depth separation in NHLs under different activation functions, and supports the theory with numerical results.
Sparsity aware generalization theory for deep neural networks. [paper]
- Ramchandran Muthukumar, Jeremias Sulam.
- Key Word: Sparse Activation; Sensitivity Analysis; PAC-Bayes Bounds.
Digest
The paper presents a novel approach to analyzing the generalization capabilities of deep feed-forward ReLU networks by considering the degree of sparsity in the hidden layer activations, revealing trade-offs between sparsity and generalization without strong assumptions about sparsity levels.
Abide by the Law and Follow the Flow: Conservation Laws for Gradient Flows. [paper]
- Sibylle Marcotte, Rémi Gribonval, Gabriel Peyré.
- Key Word: Gradient Flow; Conservation Laws; Lie Algebra.
Digest
This paper explores the concept of “conservation laws” in gradient flows and their relevance to understanding the implicit bias and generalization properties of over-parameterized machine learning models, presenting a rigorous definition of conservation laws, methods to determine the number of these quantities, and algorithms to compute polynomial and non-polynomial conservation laws.
Practical Sharpness-Aware Minimization Cannot Converge All the Way to Optima. [paper]
- Dongkuk Si, Chulhee Yun.
- Key Word: Shaprness-Aware Minimization; Convergence.
Digest
This paper explores the convergence properties of Sharpness-Aware Minimization (SAM) optimizer when used with practical configurations, such as a constant perturbation size and gradient normalization, and finds that SAM has limited capability to converge to global minima or stationary points in many scenarios.
Transformers learn through gradual rank increase. [paper]
- Enric Boix-Adsera, Etai Littwin, Emmanuel Abbe, Samy Bengio, Joshua Susskind.
- Key Word: Transformer; Gradual Rank.
Digest
The paper identifies incremental learning dynamics in transformers, where the difference between trained and initial weights progressively increases in rank, supported by theoretical proofs and experimental results.
Learning via Wasserstein-Based High Probability Generalisation Bounds. [paper]
- Paul Viallard, Maxime Haddouche, Umut Simsekli, Benjamin Guedj.
- Key Word: Generalization Bounds; PAC-Bayes; Wasserstein Distance.
Digest
This work addresses the limitations of the PAC-Bayesian framework and introduces novel Wasserstein distance-based PAC-Bayesian generalization bounds. Previous bounds relying on the Kullback-Leibler (KL) divergence were limited in capturing the geometric structure of learning problems. The proposed bounds overcome this by utilizing the Wasserstein distance, which offers stronger guarantees in terms of high probability, applicability to unbounded losses, and optimizable training objectives. The derived Wasserstein-based PAC-Bayesian learning algorithms demonstrate empirical advantages in various experiments.
Inconsistency, Instability, and Generalization Gap of Deep Neural Network Training. [paper]
- Rie Johnson, Tong Zhang.
- Key Word: Generalization Gap; Inconsistency; Instability.
Digest
The authors study how the stochasticity of training deep neural networks affects their generalization gap. They propose two measures, inconsistency and instability, that can be computed on unlabeled data and show that they are correlated with the generalization gap. They also suggest ways to reduce inconsistency and improve performance. They claim that inconsistency is more informative than the loss sharpness for predicting generalization gap.
What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization. [paper]
- Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, Zhaoran Wang.
- Key Word: In-Context Learning; Transformer; Bayesian Model Avaraging.
Digest
This paper studies In-Context Learning (ICL), which is the ability of large language models to learn new tasks from a few examples in the context1. The paper answers three questions: (a) How do language models perform ICL? (b) How to measure ICL performance and error rates? © What makes the transformer architecture suitable for ICL? The paper shows that ICL can be seen as an implicit Bayesian inference process that leverages the attention mechanism. The paper also analyzes the ICL regret, approximation and generalization bounds from an online learning perspective. The paper provides a comprehensive understanding of the transformer and its ICL ability with theoretical and empirical evidence.
Benign Overfitting in Deep Neural Networks under Lazy Training. [paper]
- Zhenyu Zhu, Fanghui Liu, Grigorios G Chrysos, Francesco Locatello, Volkan Cevher. ICML 2023
- Key Word: Benign Overfitting; Lazy Training; Neural Tangent Kernel.
Digest
The paper studies how gradient descent trains over-parameterized deep ReLU networks to achieve optimal classification performance under certain conditions. The paper connects over-parameterization, benign overfitting, and Lipschitz constant of the networks. The paper also shows that smoother functions and Neural Tangent Kernel regime improve generalization. The paper gives lower bounds on margin and eigenvalue for non-smooth activation functions.
Most Neural Networks Are Almost Learnable. [paper]
- Amit Daniely, Nathan Srebro, Gal Vardi.
- Key Word: Neural Network Learnability.
Digest
They assume that the network’s weights are initialized randomly using a standard scheme and that the input distribution is uniform on a sphere. They show that random networks with Lipschitz activation functions can be approximated by low-degree polynomials, and use this to derive a polynomial-time approximation scheme (PTAS) for learning them. They also show that for sigmoid and ReLU-like activation functions, the PTAS can be improved to a quasi-polynomial-time algorithm. They support their theory with experiments on three network architectures and three datasets.
The Crucial Role of Normalization in Sharpness-Aware Minimization. [paper]
- Yan Dai, Kwangjun Ahn, Suvrit Sra.
- Key Word: Sharpness-Awareness Minimization; Normalization.
Digest
Sharpness-Aware Minimization (SAM) is a recently proposed gradient-based optimizer that greatly improves the prediction performance of deep neural networks. There has been a surge of interest in explaining its empirical success. We focus on understanding the role played by normalization, a key component of the SAM updates. We study the effect of normalization in SAM for both convex and non-convex functions, revealing two key roles played by normalization. These two properties of normalization make SAM robust against the choice of hyper-parameters, supporting the practicality of SAM.
From Tempered to Benign Overfitting in ReLU Neural Networks. [paper]
- Guy Kornowski, Gilad Yehudai, Ohad Shamir.
- Key Word: Overparameterized neural networks; Benign overfitting; Tempered overfitting.
Digest
Overparameterized neural networks (NNs) are observed to generalize well even when trained to perfectly fit noisy data. This phenomenon motivated a large body of work on “benign overfitting”, where interpolating predictors achieve near-optimal performance. Recently, it was conjectured and empirically observed that the behavior of NNs is often better described as “tempered overfitting”. In this work, we provide several results that aim at bridging these complementing views. We study a simple classification setting with 2-layer ReLU NNs, and prove that under various assumptions, the type of overfitting transitions from tempered in the extreme case of one-dimensional data, to benign in high dimensions.
When are ensembles really effective? [paper]
- Ryan Theisen, Hyunsuk Kim, Yaoqing Yang, Liam Hodgkinson, Michael W. Mahoney.
- Key Word: Ensemble; Disagreement-Error Ratio.
Digest
Ensembling is a machine learning technique that combines multiple models to improve the overall performance. Ensembling has a long history in statistical data analysis, but its benefits are not always obvious in modern machine learning settings. We study the fundamental question of when ensembling yields significant performance improvements in classification tasks. We prove new results relating the ensemble improvement rate to the disagreement-error ratio. We show that ensembling improves performance significantly whenever the disagreement rate is large relative to the average error rate.
The Hessian perspective into the Nature of Convolutional Neural Networks. [paper]
- Sidak Pal Singh, Thomas Hofmann, Bernhard Schölkopf. ICML 2023
- Key Word: Hessian Maps; Convolutional Neural Networks.
Digest
We provide a novel perspective on Convolutional Neural Networks (CNNs) by studying their Hessian maps, which capture parameter interactions. Using a Toeplitz representation framework, we reveal the Hessian structure and establish tight upper bounds on its rank. Our findings show that the Hessian rank in CNNs grows as the square root of the number of parameters, challenging previous assumptions.
Model-agnostic Measure of Generalization Difficulty. [paper] [code]
- Akhilan Boopathy, Kevin Liu, Jaedong Hwang, Shu Ge, Asaad Mohammedsaleh, Ila Fiete. ICML 2023
- Key Word: Generalization Difficulty; Information Content of Inductive Biases.
Digest
The measure of a machine learning algorithm is the difficulty of the tasks it can perform, and sufficiently difficult tasks are critical drivers of strong machine learning models. However, quantifying the generalization difficulty of machine learning benchmarks has remained challenging. We propose what is to our knowledge the first model-agnostic measure of the inherent generalization difficulty of tasks. Our inductive bias complexity measure quantifies the total information required to generalize well on a task minus the information provided by the data.
Wasserstein PAC-Bayes Learning: A Bridge Between Generalisation and Optimisation. [paper]
- Maxime Haddouche, Benjamin Guedj.
- Key Word: PAC-Bayes Bound; Wasserstein Distances.
Digest
PAC-Bayes learning is an established framework to assess the generalisation ability of learning algorithm during the training phase. However, it remains challenging to know whether PAC-Bayes is useful to understand, before training, why the output of well-known algorithms generalise well. We positively answer this question by expanding the Wasserstein PAC-Bayes framework, briefly introduced in \cite{amit2022ipm}. We provide new generalisation bounds exploiting geometric assumptions on the loss function. Using our framework, we prove, before any training, that the output of an algorithm from \citet{lambert2022variational} has a strong asymptotic generalisation ability. More precisely, we show that it is possible to incorporate optimisation results within a generalisation framework, building a bridge between PAC-Bayes and optimisation algorithms.
Do deep neural networks have an inbuilt Occam’s razor? [paper]
- Chris Mingard, Henry Rees, Guillermo Valle-Pérez, Ard A. Louis.
- Key Word: Kolmogorov Complexity; Inductive Bias; Occam’s Razor; No Free Lunch Theorems.
Digest
The remarkable performance of overparameterized deep neural networks (DNNs) must arise from an interplay between network architecture, training algorithms, and structure in the data. To disentangle these three components, we apply a Bayesian picture, based on the functions expressed by a DNN, to supervised learning. The prior over functions is determined by the network, and is varied by exploiting a transition between ordered and chaotic regimes. For Boolean function classification, we approximate the likelihood using the error spectrum of functions on data. When combined with the prior, this accurately predicts the posterior, measured for DNNs trained with stochastic gradient descent. This analysis reveals that structured data, combined with an intrinsic Occam’s razor-like inductive bias towards (Kolmogorov) simple functions that is strong enough to counteract the exponential growth of the number of functions with complexity, is a key to the success of DNNs.
The No Free Lunch Theorem, Kolmogorov Complexity, and the Role of Inductive Biases in Machine Learning. [paper]
- Micah Goldblum, Marc Finzi, Keefer Rowan, Andrew Gordon Wilson.
- Key Word: No Free Lunch Theorem; Kolmogorov Complexity; Model Selection.
Digest
No free lunch theorems for supervised learning state that no learner can solve all problems or that all learners achieve exactly the same accuracy on average over a uniform distribution on learning problems. Accordingly, these theorems are often referenced in support of the notion that individual problems require specially tailored inductive biases. While virtually all uniformly sampled datasets have high complexity, real-world problems disproportionately generate low-complexity data, and we argue that neural network models share this same preference, formalized using Kolmogorov complexity. Notably, we show that architectures designed for a particular domain, such as computer vision, can compress datasets on a variety of seemingly unrelated domains.
The Benefits of Mixup for Feature Learning. [paper]
- Difan Zou, Yuan Cao, Yuanzhi Li, Quanquan Gu.
- Key Word: Mixup; Data Augmentation; Feature Learning.
Digest
We first show that Mixup using different linear interpolation parameters for features and labels can still achieve similar performance to the standard Mixup. This indicates that the intuitive linearity explanation in Zhang et al., (2018) may not fully explain the success of Mixup. Then we perform a theoretical study of Mixup from the feature learning perspective. We consider a feature-noise data model and show that Mixup training can effectively learn the rare features (appearing in a small fraction of data) from its mixture with the common features (appearing in a large fraction of data). In contrast, standard training can only learn the common features but fails to learn the rare features, thus suffering from bad generalization performance. Moreover, our theoretical analysis also shows that the benefits of Mixup for feature learning are mostly gained in the early training phase, based on which we propose to apply early stopping in Mixup.
Bayes Complexity of Learners vs Overfitting. [paper]
- Grzegorz Głuch, Rudiger Urbanke.
- Key Word: PAC-Bayes; Bayes Complexity; Overfitting.
Digest
We introduce a new notion of complexity of functions and we show that it has the following properties: (i) it governs a PAC Bayes-like generalization bound, (ii) for neural networks it relates to natural notions of complexity of functions (such as the variation), and (iii) it explains the generalization gap between neural networks and linear schemes.
Benign Overfitting in Linear Classifiers and Leaky ReLU Networks from KKT Conditions for Margin Maximization. [paper]
- Spencer Frei, Gal Vardi, Peter L. Bartlett, Nathan Srebro.
- Key Word: Benign Overfitting; Implicit Bias.
Digest
Linear classifiers and leaky ReLU networks trained by gradient flow on the logistic loss have an implicit bias towards solutions which satisfy the Karush—Kuhn—Tucker (KKT) conditions for margin maximization. In this work we establish a number of settings where the satisfaction of these KKT conditions implies benign overfitting in linear classifiers and in two-layer leaky ReLU networks: the estimators interpolate noisy training data and simultaneously generalize well to test data.
The Double-Edged Sword of Implicit Bias: Generalization vs. Robustness in ReLU Networks. [paper]
- Spencer Frei, Gal Vardi, Peter L. Bartlett, Nathan Srebro.
- Key Word: Implicit Bias; Adversarial Robustness.
Digest
In this work, we study the implications of the implicit bias of gradient flow on generalization and adversarial robustness in ReLU networks. We focus on a setting where the data consists of clusters and the correlations between cluster means are small, and show that in two-layer ReLU networks gradient flow is biased towards solutions that generalize well, but are highly vulnerable to adversarial examples. Our results hold even in cases where the network has many more parameters than training examples.
Why (and When) does Local SGD Generalize Better than SGD? [paper]
- Xinran Gu, Kaifeng Lyu, Longbo Huang, Sanjeev Arora. ICLR 2023
- Key Word: Local Stochastic Gradient Descent; Stochastic Differential Equations.
Digest
This paper aims to understand why (and when) Local SGD generalizes better based on Stochastic Differential Equation (SDE) approximation. The main contributions of this paper include (i) the derivation of an SDE that captures the long-term behavior of Local SGD in the small learning rate regime, showing how noise drives the iterate to drift and diffuse after it has reached close to the manifold of local minima, (ii) a comparison between the SDEs of Local SGD and SGD, showing that Local SGD induces a stronger drift term that can result in a stronger effect of regularization, e.g., a faster reduction of sharpness, and (iii) empirical evidence validating that having a small learning rate and long enough training time enables the generalization improvement over SGD but removing either of the two conditions leads to no improvement.
Hiding Data Helps: On the Benefits of Masking for Sparse Coding. [paper]
- Muthu Chidambaram, Chenwei Wu, Yu Cheng, Rong Ge.
- Key Word: Sparse Coding; Self-Supervised Learning.
Digest
We show that for over-realized sparse coding in the presence of noise, minimizing the standard dictionary learning objective can fail to recover the ground-truth dictionary, regardless of the magnitude of the signal in the data-generating process. Furthermore, drawing from the growing body of work on self-supervised learning, we propose a novel masking objective and we prove that minimizing this new objective can recover the ground-truth dictionary.
Phase diagram of training dynamics in deep neural networks: effect of learning rate, depth, and width. [paper]
- Dayal Singh Kalra, Maissam Barkeshli.
- Key Word: Sharpness; Neural Tangent Kernel.
Digest
By analyzing the maximum eigenvalue λHt of the Hessian of the loss, which is a measure of sharpness of the loss landscape, we find that the dynamics can show four distinct regimes: (i) an early time transient regime, (ii) an intermediate saturation regime, (iii) a progressive sharpening regime, and finally (iv) a late time ``edge of stability” regime.
Sharpness-Aware Minimization: An Implicit Regularization Perspective. [paper]
- Kayhan Behdin, Rahul Mazumder.
- Key Word: Sharpness-Aware Minimization; Implicit Regularization.
Digest
We study SAM through an implicit regularization lens, and present a new theoretical explanation of why SAM generalizes well. To this end, we study the least-squares linear regression problem and show a bias-variance trade-off for SAM’s error over the course of the algorithm. We show SAM has lower bias compared to Gradient Descent (GD), while having higher variance.
Modular Deep Learning. [paper]
- Jonas Pfeiffer, Sebastian Ruder, Ivan Vulić, Edoardo Maria Ponti.
- Key Word: Parameter-Efficient Fine-Tuning; Mixture-of-Expert; Rounting; Model Aggregation.
Digest
Modular deep learning has emerged as a promising solution to these challenges. In this framework, units of computation are often implemented as autonomous parameter-efficient modules. Information is conditionally routed to a subset of modules and subsequently aggregated. These properties enable positive transfer and systematic generalisation by separating computation from routing and updating modules locally. We offer a survey of modular architectures, providing a unified view over several threads of research that evolved independently in the scientific literature.
mSAM: Micro-Batch-Averaged Sharpness-Aware Minimization. [paper]
- Kayhan Behdin, Qingquan Song, Aman Gupta, Sathiya Keerthi, Ayan Acharya, Borja Ocejo, Gregory Dexter, Rajiv Khanna, David Durfee, Rahul Mazumder.
- Key Word: Sharpness-Aware Minimization.
Digest
We focus on a variant of SAM known as micro-batch SAM (mSAM), which, during training, averages the updates generated by adversarial perturbations across several disjoint shards (micro batches) of a mini-batch. We extend a recently developed and well-studied general framework for flatness analysis to show that distributed gradient computation for sharpness-aware minimization theoretically achieves even flatter minima.
Machine Love. [paper]
- Joel Lehman.
- Key Word: Maslow’s Gridworld; Psychology.
Digest
While ML generates much economic value, many of us have problematic relationships with social media and other ML-powered applications. One reason is that ML often optimizes for what we want in the moment, which is easy to quantify but at odds with what is known scientifically about human flourishing. Thus, through its impoverished models of us, ML currently falls far short of its exciting potential, which is for it to help us to reach ours. While there is no consensus on defining human flourishing, from diverse perspectives across psychology, philosophy, and spiritual traditions, love is understood to be one of its primary catalysts. Motivated by this view, this paper explores whether there is a useful conception of love fitting for machines to embody, as historically it has been generative to explore whether a nebulous concept, such as life or intelligence, can be thoughtfully abstracted and reimagined, as in the fields of machine intelligence or artificial life.
PAC-Bayesian Generalization Bounds for Adversarial Generative Models. [paper]
- Sokhna Diarra Mbacke, Florence Clerc, Pascal Germain.
- Key Word: PAC-Bayes; Generative Model Generalization Bound.
Digest
We extend PAC-Bayesian theory to generative models and develop generalization bounds for models based on the Wasserstein distance and the total variation distance. Our first result on the Wasserstein distance assumes the instance space is bounded, while our second result takes advantage of dimensionality reduction. Our results naturally apply to Wasserstein GANs and Energy-Based GANs, and our bounds provide new training objectives for these two.
SAM operates far from home: eigenvalue regularization as a dynamical phenomenon. [paper]
- Atish Agarwala, Yann N. Dauphin.
- Key Word: Sharpness-Aware Minimization.
Digest
Our work reveals that SAM provides a strong regularization of the eigenvalues throughout the learning trajectory. We show that in a simplified setting, SAM dynamically induces a stabilization related to the edge of stability (EOS) phenomenon observed in large learning rate gradient descent. Our theory predicts the largest eigenvalue as a function of the learning rate and SAM radius parameters.
Interpolation Learning With Minimum Description Length. [paper]
- Naren Sarayu Manoj, Nathan Srebro.
- Key Word: Minimum Description Length; Benign Overfitting; Tempered Overfitting.
Digest
We prove that the Minimum Description Length learning rule exhibits tempered overfitting. We obtain tempered agnostic finite sample learning guarantees and characterize the asymptotic behavior in the presence of random label noise.
A modern look at the relationship between sharpness and generalization. [paper]
- Maksym Andriushchenko, Francesco Croce, Maximilian Müller, Matthias Hein, Nicolas Flammarion.
- Key Word: Sharpness; Generalization.
Digest
We comprehensively explore this question in a detailed study of various definitions of adaptive sharpness in settings ranging from training from scratch on ImageNet and CIFAR-10 to fine-tuning CLIP on ImageNet and BERT on MNLI. We focus mostly on transformers for which little is known in terms of sharpness despite their widespread usage. Overall, we observe that sharpness does not correlate well with generalization but rather with some training parameters like the learning rate that can be positively or negatively correlated with generalization depending on the setup.
A Theoretical Understanding of shallow Vision Transformers: Learning, Generalization, and Sample Complexity. [paper]
- Hongkang Li, Meng Wang, Sijia Liu, Pin-yu Chen. ICLR 2023
- Key Word: Vision Transformer; Token Sparsification; Sample Complexity Bound.
Digest
Based on a data model characterizing both label-relevant and label-irrelevant tokens, this paper provides the first theoretical analysis of training a shallow ViT, i.e., one self-attention layer followed by a two-layer perceptron, for a classification task. We characterize the sample complexity to achieve a zero generalization error. Our sample complexity bound is positively correlated with the inverse of the fraction of label-relevant tokens, the token noise level, and the initial model error.
Tighter PAC-Bayes Bounds Through Coin-Betting. [paper]
- Kyoungseok Jang, Kwang-Sung Jun, Ilja Kuzborskij, Francesco Orabona.
- Key Word: PAC-Bayes Bounds.
Digest
Recently, the PAC-Bayes framework has been proposed as a better alternative for this class of problems for its ability to often give numerically non-vacuous bounds. In this paper, we show that we can do even better: we show how to refine the proof strategy of the PAC-Bayes bounds and achieve even tighter guarantees. Our approach is based on the coin-betting framework that derives the numerically tightest known time-uniform concentration inequalities from the regret guarantees of online gambling algorithms.
A unified recipe for deriving (time-uniform) PAC-Bayes bounds. [paper]
- Ben Chugg, Hongjian Wang, Aaditya Ramdas.
- Key Word: PAC-Bayes Bounds.
Digest
We present a unified framework for deriving PAC-Bayesian generalization bounds. Unlike most previous literature on this topic, our bounds are anytime-valid (i.e., time-uniform), meaning that they hold at all stopping times, not only for a fixed sample size. Our approach combines four tools in the following order: (a) nonnegative supermartingales or reverse submartingales, (b) the method of mixtures, (c) the Donsker-Varadhan formula (or other convex duality principles), and (d) Ville’s inequality.
The SSL Interplay: Augmentations, Inductive Bias, and Generalization. [paper]
- Vivien Cabannes, Bobak T. Kiani, Randall Balestriero, Yann LeCun, Alberto Bietti.
- Key Word: Self-Supervised Learning; Data Augmentation; Inductive Bias.
Digest
Self-supervised learning (SSL) has emerged as a powerful framework to learn representations from raw data without supervision. Yet in practice, engineers face issues such as instability in tuning optimizers and collapse of representations during training. Such challenges motivate the need for a theory to shed light on the complex interplay between the choice of data augmentation, network architecture, and training algorithm. We study such an interplay with a precise analysis of generalization performance on both pretraining and downstream tasks in a theory friendly setup, and highlight several insights for SSL practitioners that arise from our theory.
A Stability Analysis of Fine-Tuning a Pre-Trained Model. [paper]
- Zihao Fu, Anthony Man-Cho So, Nigel Collier.
- Key Word: Fine-Tuning; Stability Analysis.
Digest
We propose a novel theoretical stability analysis of fine-tuning that focuses on two commonly used settings, namely, full fine-tuning and head tuning. We define the stability under each setting and prove the corresponding stability bounds. The theoretical bounds explain why and how several existing methods can stabilize the fine-tuning procedure.
Strong inductive biases provably prevent harmless interpolation. [paper] [code]
- Michael Aerni, Marco Milanta, Konstantin Donhauser, Fanny Yang.
- Key Word: Benign Overfitting; Inductive Bias.
Digest
This paper argues that the degree to which interpolation is harmless hinges upon the strength of an estimator’s inductive bias, i.e., how heavily the estimator favors solutions with a certain structure: while strong inductive biases prevent harmless interpolation, weak inductive biases can even require fitting noise to generalize well.
PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization. [paper] [code]
- Sanae Lotfi, Marc Finzi, Sanyam Kapoor, Andres Potapczynski, Micah Goldblum, Andrew Gordon Wilson.
- Key Word: PAC-Bayes; Model Compression.
Digest
We develop a compression approach based on quantizing neural network parameters in a linear subspace, profoundly improving on previous results to provide state-of-the-art generalization bounds on a variety of tasks, including transfer learning. We use these tight bounds to better understand the role of model size, equivariance, and the implicit biases of optimization, for generalization in deep learning.
Instance-Dependent Generalization Bounds via Optimal Transport. [paper]
- Songyan Hou, Parnian Kassraie, Anastasis Kratsios, Jonas Rothfuss, Andreas Krause.
- Key Word: Generalization Bounds; Optimal Transport; Distribution Shifts.
Digest
We propose a novel optimal transport interpretation of the generalization problem. This allows us to derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function} in the data space. Therefore, our bounds are agnostic to the parametrization of the model and work well when the number of training samples is much smaller than the number of parameters. With small modifications, our approach yields accelerated rates for data on low-dimensional manifolds, and guarantees under distribution shifts. We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
How Does Sharpness-Aware Minimization Minimize Sharpness? [paper]
- Kaiyue Wen, Tengyu Ma, Zhiyuan Li.
- Key Word: Sharpness-Aware Minimization.
Digest
This paper rigorously nails down the exact sharpness notion that SAM regularizes and clarifies the underlying mechanism. We also show that the two steps of approximations in the original motivation of SAM individually lead to inaccurate local conclusions, but their combination accidentally reveals the correct effect, when full-batch gradients are applied. Furthermore, we also prove that the stochastic version of SAM in fact regularizes the third notion of sharpness mentioned above, which is most likely to be the preferred notion for practical performance. The key mechanism behind this intriguing phenomenon is the alignment between the gradient and the top eigenvector of Hessian when SAM is applied.
Augmentation Invariant Manifold Learning. [paper]
- Shulei Wang.
- Key Word: Manifold Learning; Data Augmentation.
Digest
We develop a statistical framework on a low-dimension product manifold to theoretically understand why the unlabeled augmented data can lead to useful data representation. Under this framework, we propose a new representation learning method called augmentation invariant manifold learning and develop the corresponding loss function, which can work with a deep neural network to learn data representations.
The Curious Case of Benign Memorization. [paper]
- Sotiris Anagnostidis, Gregor Bachmann, Lorenzo Noci, Thomas Hofmann.
- Key Word: Memorization; Data Augmentation.
Digest
We show that under training protocols that include data augmentation, neural networks learn to memorize entirely random labels in a benign way, i.e. they learn embeddings that lead to highly non-trivial performance under nearest neighbour probing. We demonstrate that deep models have the surprising ability to separate noise from signal by distributing the task of memorization and feature learning to different layers.
Symmetries, flat minima, and the conserved quantities of gradient flow. [paper]
- Bo Zhao, Iordan Ganev, Robin Walters, Rose Yu, Nima Dehmamy.
- Key Word: Conserved quantities; Mode Connectivity; Flat Minimia; Parameter Space Symmetry.
Digest
The paper presents a general framework that identifies continuous symmetries in the parameter space of deep neural networks, which create low-loss valleys and connect local minima. The framework utilizes equivariances of activation functions and introduces nonlinear, data-dependent symmetries for nonlinear neural networks. The authors demonstrate that conserved quantities associated with linear symmetries can be used to define coordinates along low-loss valleys. Additionally, they relate these conserved quantities to convergence rate and sharpness of the minimum, shedding light on the limitations of gradient flow exploration.
Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup. [paper]
- Muthu Chidambaram, Xiang Wang, Chenwei Wu, Rong Ge.
- Key Word: Mixup; Feature Learning.
Digest
We try to explain some of this success from a feature learning perspective. We focus our attention on classification problems in which each class may have multiple associated features (or views) that can be used to predict the class correctly. Our main theoretical results demonstrate that, for a non-trivial class of data distributions with two features per class, training a 2-layer convolutional network using empirical risk minimization can lead to learning only one feature for almost all classes while training with a specific instantiation of Mixup succeeds in learning both features for every class.
A PAC-Bayesian Generalization Bound for Equivariant Networks. [paper]
- Arash Behboodi, Gabriele Cesa, Taco Cohen. NeurIPS 2022
- Key Word: PAC-Bayes; Equivariant Networks.
Digest
We study how equivariance relates to generalization error utilizing PAC Bayesian analysis for equivariant networks, where the transformation laws of feature spaces are determined by group representations. By using perturbation analysis of equivariant networks in Fourier domain for each layer, we derive norm-based PAC-Bayesian generalization bounds. The bound characterizes the impact of group size, and multiplicity and degree of irreducible representations on the generalization error and thereby provide a guideline for selecting them.
Tighter PAC-Bayes Generalisation Bounds by Leveraging Example Difficulty. [paper]
- Felix Biggs, Benjamin Guedj.
- Key Word: PAC-Bayes.
Digest
We introduce a modified version of the excess risk, which can be used to obtain tighter, fast-rate PAC-Bayesian generalisation bounds. This modified excess risk leverages information about the relative hardness of data examples to reduce the variance of its empirical counterpart, tightening the bound. We combine this with a new bound for [−1,1]-valued (and potentially non-independent) signed losses, which is more favourable when they empirically have low variance around 0. The primary new technical tool is a novel result for sequences of interdependent random vectors which may be of independent interest. We empirically evaluate these new bounds on a number of real-world datasets.
How Mask Matters: Towards Theoretical Understandings of Masked Autoencoders. [paper] [code]
- Qi Zhang, Yifei Wang, Yisen Wang. NeurIPS 2022
- Key Word: Masked Autoencoders.
Digest
We propose a theoretical understanding of how masking matters for MAE to learn meaningful features. We establish a close connection between MAE and contrastive learning, which shows that MAE implicit aligns the mask-induced positive pairs. Built upon this connection, we develop the first downstream guarantees for MAE methods, and analyze the effect of mask ratio. Besides, as a result of the implicit alignment, we also point out the dimensional collapse issue of MAE, and propose a Uniformity-enhanced MAE (U-MAE) loss that can effectively address this issue and bring significant improvements on real-world datasets, including CIFAR-10, ImageNet-100, and ImageNet-1K.
On the Importance of Gradient Norm in PAC-Bayesian Bounds. [paper]
- Itai Gat, Yossi Adi, Alexander Schwing, Tamir Hazan. NeurIPS 2022
- Key Word: PAC-Bayes.
Digest
Generalization bounds which assess the difference between the true risk and the empirical risk, have been studied extensively. However, to obtain bounds, current techniques use strict assumptions such as a uniformly bounded or a Lipschitz loss function. To avoid these assumptions, in this paper, we follow an alternative approach: we relax uniform bounds assumptions by using on-average bounded loss and on-average bounded gradient norm assumptions. Following this relaxation, we propose a new generalization bound that exploits the contractivity of the log-Sobolev inequalities.
SGD with large step sizes learns sparse features. [paper]
- Maksym Andriushchenko, Aditya Varre, Loucas Pillaud-Vivien, Nicolas Flammarion.
- Key Word: Stochastic Gradient Descent; Sparse Features.
Digest
We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that commonly used large step sizes (i) lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics orthogonal to the bouncing directions that biases it implicitly toward simple predictors.
The good, the bad and the ugly sides of data augmentation: An implicit spectral regularization perspective. [paper]
- Chi-Heng Lin, Chiraag Kaushik, Eva L. Dyer, Vidya Muthukumar.
- Key Word: Data Augmentation; Spectral Regularization.
Digest
We develop a new theoretical framework to characterize the impact of a general class of DA on underparameterized and overparameterized linear model generalization. Our framework reveals that DA induces implicit spectral regularization through a combination of two distinct effects: a) manipulating the relative proportion of eigenvalues of the data covariance matrix in a training-data-dependent manner, and b) uniformly boosting the entire spectrum of the data covariance matrix through ridge regression.
Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias. [paper]
- Ryo Karakida, Tomoumi Takase, Tomohiro Hayase, Kazuki Osawa.
- Key Word: Gradient Regularization; Implicit Bias.
Digest
We first reveal that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost for GR. In addition, this computation empirically achieves better generalization performance. Next, we theoretically analyze a solvable model, a diagonal linear network, and clarify that GR has a desirable implicit bias in a certain problem. In particular, learning with the finite-difference GR chooses better minima as the ascent step size becomes larger.
The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima. [paper]
- Peter L. Bartlett, Philip M. Long, Olivier Bousquet.
- Key Word: Sharpness-Aware Minimization.
Digest
We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method for deep networks that has exhibited performance improvements on image and language prediction problems. We show that when SAM is applied with a convex quadratic objective, for most random initializations it converges to a cycle that oscillates between either side of the minimum in the direction with the largest curvature, and we provide bounds on the rate of convergence.
SAM as an Optimal Relaxation of Bayes. [paper]
- Thomas Möllenhoff, Mohammad Emtiyaz Khan.
- Key Word: Sharpness-Aware Minimization; Bayesian Methods.
Digest
Sharpness-aware minimization (SAM) and related adversarial deep-learning methods can drastically improve generalization, but their underlying mechanisms are not yet fully understood. Here, we establish SAM as a relaxation of the Bayes objective where the expected negative-loss is replaced by the optimal convex lower bound, obtained by using the so-called Fenchel biconjugate. The connection enables a new Adam-like extension of SAM to automatically obtain reasonable uncertainty estimates, while sometimes also improving its accuracy. By connecting adversarial and Bayesian methods, our work opens a new path to robustness.
Understanding Influence Functions and Datamodels via Harmonic Analysis. [paper]
- Nikunj Saunshi, Arushi Gupta, Mark Braverman, Sanjeev Arora.
- Key Word: Influence Functions; Harmonic Analysis.
Digest
The current paper seeks to provide a better theoretical understanding of such interesting empirical phenomena. The primary tool is harmonic analysis and the idea of noise stability. Contributions include: (a) Exact characterization of the learnt datamodel in terms of Fourier coefficients. (b) An efficient method to estimate the residual error and quality of the optimum linear datamodel without having to train the datamodel. (c) New insights into when influences of groups of datapoints may or may not add up linearly.
Plateau in Monotonic Linear Interpolation — A “Biased” View of Loss Landscape for Deep Networks. [paper]
- Xiang Wang, Annie N. Wang, Mo Zhou, Rong Ge.
- Key Word: Monotonic Linear Interpolation; Loss Landscapes.
Digest
We show that the MLI property is not necessarily related to the hardness of optimization problems, and empirical observations on MLI for deep neural networks depend heavily on biases. In particular, we show that interpolating both weights and biases linearly leads to very different influences on the final output, and when different classes have different last-layer biases on a deep network, there will be a long plateau in both the loss and accuracy interpolation (which existing theory of MLI cannot explain).
Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability. [paper]
- Alex Damian, Eshaan Nichani, Jason D. Lee.
- Key Word: Implicit Bias; Edge of Stability.
Digest
Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness S(θ), is bounded by 2/η, training is “stable” and the training loss decreases monotonically. Recent works, however, have observed that this assumption does not hold when training modern neural networks with full batch or large batch gradient descent. Most recently, Cohen et al. (2021) observed two important phenomena. The first, dubbed progressive sharpening, is that the sharpness steadily increases throughout training until it reaches the instability cutoff 2/η. The second, dubbed edge of stability, is that the sharpness hovers at 2/η for the remainder of training while the loss continues decreasing, albeit non-monotonically. We demonstrate that, far from being chaotic, the dynamics of gradient descent at the edge of stability can be captured by a cubic Taylor expansion: as the iterates diverge in direction of the top eigenvector of the Hessian due to instability, the cubic term in the local Taylor expansion of the loss function causes the curvature to decrease until stability is restored.
Implicit Bias of Large Depth Networks: a Notion of Rank for Nonlinear Functions. [paper]
- Arthur Jacot.
- Key Word: Non-Linear Rank; Implicit Bias.
Digest
We show that the representation cost of fully connected neural networks with homogeneous nonlinearities - which describes the implicit bias in function space of networks with L2-regularization or with losses such as the cross-entropy - converges as the depth of the network goes to infinity to a notion of rank over nonlinear functions. We then inquire under which conditions the global minima of the loss recover the `true’ rank of the data: we show that for too large depths the global minimum will be approximately rank 1 (underestimating the rank); we then argue that there is a range of depths which grows with the number of datapoints where the true rank is recovered.
Scaling Laws For Deep Learning Based Image Reconstruction. [paper]
- Tobit Klug, Reinhard Heckel.
- Key Word: Scaling Laws; Inverse Problems.
Digest
We study whether major performance gains are expected from scaling up the training set size. We consider image denoising, accelerated magnetic resonance imaging, and super-resolution and empirically determine the reconstruction quality as a function of training set size, while optimally scaling the network size. For all three tasks we find that an initially steep power-law scaling slows significantly already at moderate training set sizes. Interpolating those scaling laws suggests that even training on millions of images would not significantly improve performance.
Why neural networks find simple solutions: the many regularizers of geometric complexity. [paper]
- Benoit Dherin, Michael Munn, Mihaela C. Rosca, David G.T. Barrett. NeurIPS 2022
- Key Word: Regularization; Geometric Complexity; Dirichlet Energy.
Digest
In many contexts, simpler models are preferable to more complex models and the control of this model complexity is the goal for many methods in machine learning such as regularization, hyperparameter tuning and architecture design. In deep learning, it has been difficult to understand the underlying mechanisms of complexity control, since many traditional measures are not naturally suitable for deep neural networks. Here we develop the notion of geometric complexity, which is a measure of the variability of the model function, computed using a discrete Dirichlet energy. Using a combination of theoretical arguments and empirical results, we show that many common training heuristics such as parameter norm regularization, spectral norm regularization, flatness regularization, implicit gradient regularization, noise regularization and the choice of parameter initialization all act to control geometric complexity, providing a unifying framework in which to characterize the behavior of deep learning models.
Variational Inference for Infinitely Deep Neural Networks. [paper]
- Achille Nazaret, David Blei. ICML 2022
- Key Word: Unbounded Depth Neural Networks; Variational Inference.
Digest
We develop a novel variational inference algorithm to approximate this posterior, optimizing a distribution of the neural network weights and of the truncation depth L, and without any upper limit on L. To this end, the variational family has a special structure: it models neural network weights of arbitrary depth, and it dynamically creates or removes free variational parameters as its distribution of the truncation is optimized.
Deep Linear Networks can Benignly Overfit when Shallow Ones Do. [paper]
- Niladri S. Chatterji, Philip M. Long.
- Key Word: Benign Overfitting; Double Descent; Implicit Bias.
Digest
We bound the excess risk of interpolating deep linear networks trained using gradient flow. In a setting previously used to establish risk bounds for the minimum ℓ2-norm interpolant, we show that randomly initialized deep linear networks can closely approximate or even match known bounds for the minimum ℓ2-norm interpolant. Our analysis also reveals that interpolating deep linear models have exactly the same conditional variance as the minimum ℓ2-norm solution.
Robustness in deep learning: The good (width), the bad (depth), and the ugly (initialization). [paper]
- Zhenyu Zhu, Fanghui Liu, Grigorios G Chrysos, Volkan Cevher. NeurIPS 2022
- Key Word: Lazy Training; Neural Tangent Kernel.
Digest
We study the average robustness notion in deep neural networks in (selected) wide and narrow, deep and shallow, as well as lazy and non-lazy training settings. We prove that in the under-parameterized setting, width has a negative effect while it improves robustness in the over-parameterized setting. The effect of depth closely depends on the initialization and the training mode. In particular, when initialized with LeCun initialization, depth helps robustness with lazy training regime. In contrast, when initialized with Neural Tangent Kernel (NTK) and He-initialization, depth hurts the robustness.
Git Re-Basin: Merging Models modulo Permutation Symmetries. [paper]
- Samuel K. Ainsworth, Jonathan Hayase, Siddhartha Srinivasa.
- Key Word: Mode Connectivity.
Digest
We argue that neural network loss landscapes contain (nearly) a single basin, after accounting for all possible permutation symmetries of hidden units. We introduce three algorithms to permute the units of one model to bring them into alignment with units of a reference model. This transformation produces a functionally equivalent set of weights that lie in an approximately convex basin near the reference model. Experimentally, we demonstrate the single basin phenomenon across a variety of model architectures and datasets, including the first (to our knowledge) demonstration of zero-barrier linear mode connectivity between independently trained ResNet models on CIFAR-10 and CIFAR-100.
Normalization effects on deep neural networks. [paper]
- Jiahui Yu, Konstantinos Spiliopoulos.
- Key Word: Normalization.
Digest
We find that in terms of variance of the neural network’s output and test accuracy the best choice is to choose the γi’s to be equal to one, which is the mean-field scaling. We also find that this is particularly true for the outer layer, in that the neural network’s behavior is more sensitive in the scaling of the outer layer as opposed to the scaling of the inner layers. The mechanism for the mathematical analysis is an asymptotic expansion for the neural network’s output.
Do Quantum Circuit Born Machines Generalize? [paper]
- Kaitlin Gili, Mohamed Hibat-Allah, Marta Mauri, Chris Ballance, Alejandro Perdomo-Ortiz.
- Key Word: Quantum Machine Learning; Quantum Circuit Born Machines; Unsupervised Generative Models.
Digest
There has been little understanding of a model’s generalization performance and the relation between such capability and the resource requirements, e.g., the circuit depth and the amount of training data. In this work, we leverage upon a recently proposed generalization evaluation framework to begin addressing this knowledge gap. We first investigate the QCBM’s learning process of a cardinality-constrained distribution and see an increase in generalization performance while increasing the circuit depth.
Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting. [paper]
- Neil Mallinar, James B. Simon, Amirhesam Abedsoltan, Parthe Pandit, Mikhail Belkin, Preetum Nakkiran.
- Key Word: Overfitting; Kernel Regression.
Digest
The practical success of overparameterized neural networks has motivated the recent scientific study of interpolating methods, which perfectly fit their training data. Certain interpolating methods, including neural networks, can fit noisy training data without catastrophically bad test performance, in defiance of standard intuitions from statistical learning theory. Aiming to explain this, a body of recent work has studied benign overfitting, a phenomenon where some interpolating methods approach Bayes optimality, even in the presence of noise. In this work we argue that while benign overfitting has been instructive and fruitful to study, many real interpolating methods like neural networks do not fit benignly: modest noise in the training set causes nonzero (but non-infinite) excess risk at test time, implying these models are neither benign nor catastrophic but rather fall in an intermediate regime. We call this intermediate regime tempered overfitting, and we initiate its systematic study.
Towards understanding how momentum improves generalization in deep learning. [paper]
- Samy Jelassi, Yuanzhi Li. ICML 2022
- Key Word: Gradient Descent with Momentum.
Digest
We adopt another perspective and first empirically show that gradient descent with momentum (GD+M) significantly improves generalization compared to gradient descent (GD) in some deep learning problems. From this observation, we formally study how momentum improves generalization. We devise a binary classification setting where a one-hidden layer (over-parameterized) convolutional neural network trained with GD+M provably generalizes better than the same network trained with GD, when both algorithms are similarly initialized.
Implicit Bias of Gradient Descent on Reparametrized Models: On Equivalence to Mirror Descent. [paper]
- Zhiyuan Li, Tianhao Wang, JasonD. Lee, Sanjeev Arora.
- Key Word: Implicit Bias; Mirror Descent.
Digest
As part of the effort to understand implicit bias of gradient descent in overparametrized models, several results have shown how the training trajectory on the overparametrized model can be understood as mirror descent on a different objective. The main result here is a characterization of this phenomenon under a notion termed commuting parametrization, which encompasses all the previous results in this setting. It is shown that gradient flow with any commuting parametrization is equivalent to continuous mirror descent with a related Legendre function.
A law of adversarial risk, interpolation, and label noise. [paper]
- Daniel Paleka, Amartya Sanyal. ICLR 2023
- Key Word: Benign Overfitting; Adversarial Robustness.
Digest
We show that interpolating label noise induces adversarial vulnerability, and prove the first theorem showing the relationship between label noise and adversarial risk for any data distribution. Our results are almost tight if we do not make any assumptions on the inductive bias of the learning algorithm.
Integral Probability Metrics PAC-Bayes Bounds. [paper]
- Ron Amit, Baruch Epstein, Shay Moran, Ron Meir. NeurIPS 2022
- Key Word: PAC-Bayes Bound.
Digest
We present a PAC-Bayes-style generalization bound which enables the replacement of the KL-divergence with a variety of Integral Probability Metrics (IPM). We provide instances of this bound with the IPM being the total variation metric and the Wasserstein distance. A notable feature of the obtained bounds is that they naturally interpolate between classical uniform convergence bounds in the worst case (when the prior and posterior are far away from each other), and improved bounds in favorable cases (when the posterior and prior are close). This illustrates the possibility of reinforcing classical generalization bounds with algorithm- and data-dependent components, thus making them more suitable to analyze algorithms that use a large hypothesis space.
Robustness Implies Generalization via Data-Dependent Generalization Bounds. [paper]
- Kenji Kawaguchi, Zhun Deng, Kyle Luh, Jiaoyang Huang. ICML 2022
- Key Word: Algorithmic Robustness Bound.
Digest
This paper proves that robustness implies generalization via data-dependent generalization bounds. As a result, robustness and generalization are shown to be connected closely in a data-dependent manner. Our bounds improve previous bounds in two directions, to solve an open problem that has seen little development since 2010. The first is to reduce the dependence on the covering number. The second is to remove the dependence on the hypothesis space. We present several examples, including ones for lasso and deep learning, in which our bounds are provably preferable.
Learning sparse features can lead to overfitting in neural networks. [paper] [code]
- Leonardo Petrini, Francesco Cagnetta, Eric Vanden-Eijnden, Matthieu Wyart.
- Key Word: Sparse Representation; Neural Tangent Kernel.
Digest
It is widely believed that the success of deep networks lies in their ability to learn a meaningful representation of the features of the data. Yet, understanding when and how this feature learning improves performance remains a challenge: for example, it is beneficial for modern architectures trained to classify images, whereas it is detrimental for fully-connected networks trained for the same task on the same data. Here we propose an explanation for this puzzle, by showing that feature learning can perform worse than lazy training (via random feature kernel or the NTK) as the former can lead to a sparser neural representation. Although sparsity is known to be essential for learning anisotropic data, it is detrimental when the target function is constant or smooth along certain directions of input space. We illustrate this phenomenon in two settings: (i) regression of Gaussian random functions on the d-dimensional unit sphere and (ii) classification of benchmark datasets of images.
Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks. [paper]
- Jiachun Pan, Pan Zhou, Shuicheng Yan.
- Key Word: Mask-Reconstruction Pretraining; Self-Supervision.
Digest
Supervised fine-tuning the pretrained encoder remarkably surpasses the conventional supervised learning (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative semantics in the pretraining dataset, and accordingly show its provable improvement over SL on the classification downstream task.
Why do CNNs Learn Consistent Representations in their First Layer Independent of Labels and Architecture? [paper]
- Rhea Chowers, Yair Weiss.
- Key Word: Architecture Inductive Bias.
Digest
It has previously been observed that the filters learned in the first layer of a CNN are qualitatively similar for different networks and tasks. We extend this finding and show a high quantitative similarity between filters learned by different networks. We consider the CNN filters as a filter bank and measure the sensitivity of the filter bank to different frequencies. We show that the sensitivity profile of different networks is almost identical, yet far from initialization. Remarkably, we show that it remains the same even when the network is trained with random labels. To understand this effect, we derive an analytic formula for the sensitivity of the filters in the first layer of a linear CNN. We prove that when the average patch in images of the two classes is identical, the sensitivity profile of the filters in the first layer will be identical in expectation when using the true labels or random labels and will only depend on the second-order statistics of image patches.
A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage over Fixed Features. [paper]
- Zhenmei Shi, Junyi Wei, Yingyu Liang. ICLR 2022
- Key Word: Linearization of Neural Networks; Neural Tangent Kernel.
Digest
To better understand the source and benefit of feature learning in neural networks, we consider learning problems motivated by practical data, where the labels are determined by a set of class relevant patterns and the inputs are generated from these along with some background patterns. We prove that neural networks trained by gradient descent can succeed on these problems. The success relies on the emergence and improvement of effective features, which are learned among exponentially many candidates efficiently by exploiting the data (in particular, the structure of the input distribution).
Realistic Deep Learning May Not Fit Benignly. [paper]
- Kaiyue Wen, Jiaye Teng, Jingzhao Zhang.
- Key Word: Benign Overfitting.
Digest
We examine the benign overfitting phenomena in real-world settings. We found that for tasks such as training a ResNet model on ImageNet dataset, the model does not fit benignly. To understand why benign overfitting fails in the ImageNet experiment, we analyze previous benign overfitting models under a more restrictive setup where the number of parameters is not significantly larger than the number of data points.
A Model of One-Shot Generalization. [paper]
- Thomas Laurent, James H. von Brecht, Xavier Bresson.
- Key Word: One-Shot Generalization; PAC Learning; Neural Tangent Kernel.
Digest
We provide a theoretical framework to study a phenomenon that we call one-shot generalization. This phenomenon refers to the ability of an algorithm to perform transfer learning within a single task, meaning that it correctly classifies a test point that has a single exemplar in the training set. We propose a simple data model and use it to study this phenomenon in two ways. First, we prove a non-asymptotic base-line — kernel methods based on nearest-neighbor classification cannot perform one-shot generalization, independently of the choice of the kernel and the size of the training set. Second, we empirically show that the most direct neural network architecture for our data model performs one-shot generalization almost perfectly. This stark differential leads us to believe that the one-shot generalization mechanism is partially responsible for the empirical success of neural networks.
Empirical Evaluation and Theoretical Analysis for Representation Learning: A Survey. [paper]
- Kento Nozawa, Issei Sato. IJCAI 2022
- Key Word: Representation Learning; Pre-training; Regularization.
Digest
Representation learning enables us to automatically extract generic feature representations from a dataset to solve another machine learning task. Recently, extracted feature representations by a representation learning algorithm and a simple predictor have exhibited state-of-the-art performance on several machine learning tasks. Despite its remarkable progress, there exist various ways to evaluate representation learning algorithms depending on the application because of the flexibility of representation learning. To understand the current representation learning, we review evaluation methods of representation learning algorithms and theoretical analyses.
The Effects of Regularization and Data Augmentation are Class Dependent. [paper]
- Randall Balestriero, Leon Bottou, Yann LeCun. NeurIPS 2022
- Key Word: Data Augmentation.
Digest
We demonstrate that techniques such as DA or weight decay produce a model with a reduced complexity that is unfair across classes. The optimal amount of DA or weight decay found from cross-validation leads to disastrous model performances on some classes e.g. on Imagenet with a resnet50, the “barn spider” classification test accuracy falls from 68% to 46% only by introducing random crop DA during training. Even more surprising, such performance drop also appears when introducing uninformative regularization techniques such as weight decay.
Resonance in Weight Space: Covariate Shift Can Drive Divergence of SGD with Momentum. [paper]
- Kirby Banman, Liam Peet-Pare, Nidhi Hegde, Alona Fyshe, Martha White. ICLR 2022
- Key Word: Stochastic Gradient Descent; Covariate Shift.
Digest
We show that SGDm under covariate shift with a fixed step-size can be unstable and diverge. In particular, we show SGDm under covariate shift is a parametric oscillator, and so can suffer from a phenomenon known as resonance. We approximate the learning system as a time varying system of ordinary differential equations, and leverage existing theory to characterize the system’s divergence/convergence as resonant/nonresonant modes.
Data Augmentation as Feature Manipulation. [paper]
- Ruoqi Shen, Sébastien Bubeck, Suriya Gunasekar.
- Key Word: Data Augmentation; Feature Learning.
Digest
In this work we consider another angle, and we study the effect of data augmentation on the dynamic of the learning process. We find that data augmentation can alter the relative importance of various features, effectively making certain informative but hard to learn features more likely to be captured in the learning process. Importantly, we show that this effect is more pronounced for non-linear models, such as neural networks. Our main contribution is a detailed analysis of data augmentation on the learning dynamic for a two layer convolutional neural network in the recently proposed multi-view data model by Allen-Zhu and Li [2020].
How Many Data Are Needed for Robust Learning? [paper]
- Hongyang Zhang, Yihan Wu, Heng Huang.
- Key Word: Robustness.
Digest
In this work, we study the sample complexity of robust interpolation problem when the data are in a unit ball. We show that both too many data and small data hurt robustness.
A Data-Augmentation Is Worth A Thousand Samples: Exact Quantification From Analytical Augmented Sample Moments. [paper]
- Randall Balestriero, Ishan Misra, Yann LeCun. NeurIPS 2022
- Key Word: Data Augmentation.
Digest
We derive several quantities in close-form, such as the expectation and variance of an image, loss, and model’s output under a given DA distribution. Those derivations open new avenues to quantify the benefits and limitations of DA. For example, we show that common DAs require tens of thousands of samples for the loss at hand to be correctly estimated and for the model training to converge.
Discovering and Explaining the Representation Bottleneck of DNNs. [paper]
- Huiqi Deng, Qihan Ren, Hao Zhang, Quanshi Zhang. ICLR 2022
- Key Word: Representation Bottleneck; Explanation.
Digest
This paper explores the bottleneck of feature representations of deep neural networks (DNNs), from the perspective of the complexity of interactions between input variables encoded in DNNs. To this end, we focus on the multi-order interaction between input variables, where the order represents the complexity of interactions. We discover that a DNN is more likely to encode both too simple and too complex interactions, but usually fails to learn interactions of intermediate complexity. Such a phenomenon is widely shared by different DNNs for different tasks. This phenomenon indicates a cognition gap between DNNs and humans, and we call it a representation bottleneck. We theoretically prove the underlying reason for the representation bottleneck.
Generalization in quantum machine learning from few training data. [paper]
- Matthias C. Caro, Hsin-Yuan Huang, M. Cerezo, Kunal Sharma, Andrew Sornborger, Lukasz Cincio, Patrick J. Coles. Nature Communications
- Key Word: Quantum Machine Learning; Generalization Bounds.
Digest
We provide a comprehensive study of generalization performance in QML after training on a limited number N of training data points. We also show that classification of quantum states across a phase transition with a quantum convolutional neural network requires only a very small training data set. Other potential applications include learning quantum error correcting codes or quantum dynamical simulation. Our work injects new hope into the field of QML, as good generalization is guaranteed from few training data.
The Equilibrium Hypothesis: Rethinking implicit regularization in Deep Neural Networks. [paper]
- Yizhang Lou, Chris Mingard, Soufiane Hayou.
- Key Word: Implicit Regularization.
Digest
We provide the first explanation for this alignment hierarchy. We introduce and empirically validate the Equilibrium Hypothesis which states that the layers that achieve some balance between forward and backward information loss are the ones with the highest alignment to data labels.
Understanding Dimensional Collapse in Contrastive Self-supervised Learning. [paper] [code]
- Li Jing, Pascal Vincent, Yann LeCun, Yuandong Tian. ICLR 2022
- Key Word: Self-Supervision; Contrastive Learning; Implicit Regularization; Dimensional Collapse.
Digest
We show that dimensional collapse also happens in contrastive learning. In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse. Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the representation space without relying on a trainable projector.
Implicit Sparse Regularization: The Impact of Depth and Early Stopping. [paper] [code]
- Jiangyuan Li, Thanh V. Nguyen, Chinmay Hegde, Raymond K. W. Wong. NeurIPS 2021
- Key Word: Implicit Regularization.
Digest
In this paper, we study the implicit bias of gradient descent for sparse regression. We extend results on regression with quadratic parametrization, which amounts to depth-2 diagonal linear networks, to more general depth-N networks, under more realistic settings of noise and correlated designs. We show that early stopping is crucial for gradient descent to converge to a sparse model, a phenomenon that we call implicit sparse regularization. This result is in sharp contrast to known results for noiseless and uncorrelated-design cases.
The Benefits of Implicit Regularization from SGD in Least Squares Problems. [paper]
- Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, Dean P. Foster, Sham M. Kakade. NeurIPS 2021
- Key Word: Implicit Regularization.
Digest
We show: (1) for every problem instance and for every ridge parameter, (unregularized) SGD, when provided with logarithmically more samples than that provided to the ridge algorithm, generalizes no worse than the ridge solution (provided SGD uses a tuned constant stepsize); (2) conversely, there exist instances (in this wide problem class) where optimally-tuned ridge regression requires quadratically more samples than SGD in order to have the same generalization performance.
Neural Controlled Differential Equations for Online Prediction Tasks. [paper] [code]
- James Morrill, Patrick Kidger, Lingyi Yang, Terry Lyons.
- Key Word: Ordinary Differential Equations.
Digest
Neural controlled differential equations (Neural CDEs) are state-of-the-art models for irregular time series. However, due to current implementations relying on non-causal interpolation schemes, Neural CDEs cannot currently be used in online prediction tasks; that is, in real-time as data arrives. This is in contrast to similar ODE models such as the ODE-RNN which can already operate in continuous time. Here we introduce and benchmark new interpolation schemes, most notably, rectilinear interpolation, which allows for an online everywhere causal solution to be defined.
The Principles of Deep Learning Theory. [paper]
- Daniel A. Roberts, Sho Yaida, Boris Hanin.
- Key Word: Bayesian Learning; Neural Tangent Kernel; Statistical Physics; Information Theory; Residual Learning; Book.
Digest
This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics.
Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning. [paper] [code]
- Colin Wei, Sang Michael Xie, Tengyu Ma. NeurIPS 2021
- Key Word: Natural Language Processing; Pre-training; Prompting.
Digest
We propose an analysis framework that links the pretraining and downstream tasks with an underlying latent variable generative model of text — the downstream classifier must recover a function of the posterior distribution over the latent variables. We analyze head tuning (learning a classifier on top of the frozen pretrained model) and prompt tuning in this setting. The generative model in our analysis is either a Hidden Markov Model (HMM) or an HMM augmented with a latent memory component, motivated by long-term dependencies in natural language.
Differentiable Multiple Shooting Layers. [paper] [code]
- Stefano Massaroli, Michael Poli, Sho Sonoda, Taji Suzuki, Jinkyoo Park, Atsushi Yamashita, Hajime Asama. NeurIPS 2021
- Key Word: Ordinary Differential Equations.
Digest
We detail a novel class of implicit neural models. Leveraging time-parallel methods for differential equations, Multiple Shooting Layers (MSLs) seek solutions of initial value problems via parallelizable root-finding algorithms. MSLs broadly serve as drop-in replacements for neural ordinary differential equations (Neural ODEs) with improved efficiency in number of function evaluations (NFEs) and wall-clock inference time.
Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning. [paper] [code]
- Jannik Kossen, Neil Band, Clare Lyle, Aidan N. Gomez, Tom Rainforth, Yarin Gal. NeurIPS 2021
- Key Word: Samplie-Wise Self-Attention; Meta Learning; Metric Learning.
Digest
We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input. To this end, we introduce a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time. Our approach uses self-attention to reason about relationships between datapoints explicitly, which can be seen as realizing non-parametric models using parametric attention mechanisms.
Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. [paper]
- Mikhail Belkin.
- Key Word: Interpolation; Over-parameterization.
Digest
In the past decade the mathematical theory of machine learning has lagged far behind the triumphs of deep neural networks on practical challenges. However, the gap between theory and practice is gradually starting to close. In this paper I will attempt to assemble some pieces of the remarkable and still incomplete mathematical mosaic emerging from the efforts to understand the foundations of deep learning. The two key themes will be interpolation, and its sibling, over-parameterization. Interpolation corresponds to fitting data, even noisy data, exactly. Over-parameterization enables interpolation and provides flexibility to select a right interpolating model.
A Universal Law of Robustness via Isoperimetry. [paper]
- Sébastien Bubeck, Mark Sellke.
- Key Word: Overparameterized Memorization; Lipschitz Neural Network.
Digest
A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires d times more parameters than mere interpolation, where d is the ambient data dimension.
Noether’s Learning Dynamics: Role of Symmetry Breaking in Neural Networks. [paper]
- Hidenori Tanaka, Daniel Kunin.
- Key Word: Geometry of Learning Dynamics; Symmetry Breaking.
Digest
The paper develops a theoretical framework to investigate the “geometry of learning dynamics” in neural networks and uncovers the significance of explicit symmetry breaking in achieving efficiency and stability. It introduces “kinetic symmetry breaking” (KSB) as a condition where the kinetic energy breaks the symmetry of the potential function and applies Noether’s theorem to derive “Noether’s Learning Dynamics” (NLD) as a result.
Analyzing Monotonic Linear Interpolation in Neural Network Loss Landscapes. [paper]
- James Lucas, Juhan Bae, Michael R. Zhang, Stanislav Fort, Richard Zemel, Roger Grosse.
- Key Word: Monotonic Linear Interpolation; Loss Landscapes.
Digest
We evaluate several hypotheses for this property that, to our knowledge, have not yet been explored. Using tools from differential geometry, we draw connections between the interpolated paths in function space and the monotonicity of the network - providing sufficient conditions for the MLI property under mean squared error. While the MLI property holds under various settings (e.g. network architectures and learning problems), we show in practice that networks violating the MLI property can be produced systematically, by encouraging the weights to move far from initialization.
On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs). [paper]
- Zhiyuan Li, Sadhika Malladi, Sanjeev Arora. NeurIPS 2021
- Key Word: Stochastic Gradient Descent Dynamics; Stochastic Differential Equations.
Digest
The current paper clarifies the picture with the following contributions: (a) An efficient simulation algorithm SVAG that provably converges to the conventionally used Ito SDE approximation. (b) A theoretically motivated testable necessary condition for the SDE approximation and its most famous implication, the linear scaling rule (Goyal et al., 2017), to hold. (c) Experiments using this simulation to demonstrate that the previously proposed SDE approximation can meaningfully capture the training and generalization properties of common deep nets.
MALI: A memory efficient and reverse accurate integrator for Neural ODEs. [paper] [code]
- Juntang Zhuang, Nicha C. Dvornek, Sekhar Tatikonda, James S. Duncan. ICLR 2021
- Key Word: Ordinary Differential Equations.
Digest
Based on the asynchronous leapfrog (ALF) solver, we propose the Memory-efficient ALF Integrator (MALI), which has a constant memory cost w.r.t number of solver steps in integration similar to the adjoint method, and guarantees accuracy in reverse-time trajectory (hence accuracy in gradient estimation). We validate MALI in various tasks: on image recognition tasks, to our knowledge, MALI is the first to enable feasible training of a Neural ODE on ImageNet and outperform a well-tuned ResNet, while existing methods fail due to either heavy memory burden or inaccuracy.
Understanding the Failure Modes of Out-of-Distribution Generalization. [paper] [code]
- Vaishnavh Nagarajan, Anders Andreassen, Behnam Neyshabur. ICLR 2021
- Key Word: Out-of-Distribution Generalization.
Digest
We identify that spurious correlations during training can induce two distinct skews in the training set, one geometric and another statistical. These skews result in two complementary ways by which empirical risk minimization (ERM) via gradient descent is guaranteed to rely on those spurious correlations.
Deep Networks from the Principle of Rate Reduction. [paper] [code]
- Kwan Ho Ryan Chan, Yaodong Yu, Chong You, Haozhi Qi, John Wright, Yi Ma.
- Key Word: Maximal Coding Rate Reduction.
Digest
This work attempts to interpret modern deep (convolutional) networks from the principles of rate reduction and (shift) invariant classification. We show that the basic iterative gradient ascent scheme for optimizing the rate reduction of learned features naturally leads to a multi-layer deep network, one iteration per layer. The layered architectures, linear and nonlinear operators, and even parameters of the network are all explicitly constructed layer-by-layer in a forward propagation fashion by emulating the gradient scheme.
Sharpness-Aware Minimization for Efficiently Improving Generalization. [paper] [code]
- Pierre Foret, Ariel Kleiner, Hossein Mobahi, Behnam Neyshabur. ICLR 2021
- Key Word: Flat Minima.
Digest
In today’s heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by prior work connecting the geometry of the loss landscape and generalization, we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a min-max optimization problem on which gradient descent can be performed efficiently.
Implicit Gradient Regularization. [paper]
- David G.T. Barrett, Benoit Dherin. ICLR 2021
- Key Word: Implicit Regularization.
Digest
Gradient descent can be surprisingly good at optimizing deep neural networks without overfitting and without explicit regularization. We find that the discrete steps of gradient descent implicitly regularize models by penalizing gradient descent trajectories that have large loss gradients. We call this Implicit Gradient Regularization (IGR) and we use backward error analysis to calculate the size of this regularization. We confirm empirically that implicit gradient regularization biases gradient descent toward flat minima, where test errors are small and solutions are robust to noisy parameter perturbations.
Neural Rough Differential Equations for Long Time Series. [paper] [code]
- James Morrill, Cristopher Salvi, Patrick Kidger, James Foster, Terry Lyons. ICML 2021
- Key Word: Ordinary Differential Equations.
Digest
Neural Controlled Differential Equations (Neural CDEs) are the continuous-time analogue of an RNN. However, as with RNNs, training can quickly become impractical for long time series. Here we use rough path theory to extend this formulation through application of a pre-existing mathematical tool from rough analysis - the log-ODE method - which allows us to take integration steps larger than the discretisation of the data, resulting in significantly faster training times, with retainment (and often even improvements) in model performance.
Optimizing Mode Connectivity via Neuron Alignment. [paper] [code]
- N. Joseph Tatro, Pin-Yu Chen, Payel Das, Igor Melnyk, Prasanna Sattigeri, Rongjie Lai. NeurIPS 2020
- Key Word: Mode Connectivity; Neuron Alignment; Adversarial Training.
Digest
We propose a more general framework to investigate the effect of symmetry on landscape connectivity by accounting for the weight permutations of the networks being connected. To approximate the optimal permutation, we introduce an inexpensive heuristic referred to as neuron alignment. Neuron alignment promotes similarity between the distribution of intermediate activations of models along the curve.
Benign Overfitting and Noisy Features. [paper]
- Zhu Li, Weijie Su, Dino Sejdinovic.
- Key Word: Benign Overfitting; Random Feature Approximation; Deep Double Descent.
Digest
We examine the conditions under which Benign Overfitting occurs in the random feature (RF) models, i.e. in a two-layer neural network with fixed first layer weights. We adopt a new view of random feature and show that benign overfitting arises due to the noise which resides in such features (the noise may already be present in the data and propagate to the features or it may be added by the user to the features directly) and plays an important implicit regularization role in the phenomenon.
Expressivity of Deep Neural Networks. [paper]
- Ingo Gühring, Mones Raslan, Gitta Kutyniok.
- Key Word: Approximation; Expressivity; Function Classes
Digest
In this review paper, we give a comprehensive overview of the large variety of approximation results for neural networks. Approximation rates for classical function spaces as well as benefits of deep neural networks over shallow ones for specifically structured function classes are discussed. While the mainbody of existing results is for general feedforward architectures, we also depict approximation results for convolutional, residual and recurrent neural networks.
How benign is benign overfitting? [paper]
- Amartya Sanyal, Puneet K Dokania, Varun Kanade, Philip H.S. Torr. ICLR 2021
- Key Word: Benign Overfitting; Adversarial Robustness.
Digest
We investigate two causes for adversarial vulnerability in deep neural networks: bad data and (poorly) trained models. When trained with SGD, deep neural networks essentially achieve zero training error, even in the presence of label noise, while also exhibiting good generalization on natural test data, something referred to as benign overfitting. However, these models are vulnerable to adversarial attacks. We identify label noise as one of the causes for adversarial vulnerability, and provide theoretical and empirical evidence in support of this. Surprisingly, we find several instances of label noise in datasets such as MNIST and CIFAR, and that robustly trained models incur training error on some of these, i.e. they don’t fit the noise.
On the Theory of Transfer Learning: The Importance of Task Diversity. [paper]
- Nilesh Tripuraneni, Michael I. Jordan, Chi Jin. NeurIPS 2020
- Key Word: Transfer Learning; Task Diversity; Generalization Bound.
Digest
We introduce a problem-agnostic definition of task diversity which can be integrated into a uniform convergence framework to provide generalization bounds for transfer learning problems with general losses, tasks, and features. Our framework puts this notion of diversity together with a common-design assumption across tasks to provide guarantees of a fast convergence rate, decaying with all of the samples for the transfer learning problem.
Neural Controlled Differential Equations for Irregular Time Series. [paper] [code]
- Patrick Kidger, James Morrill, James Foster, Terry Lyons. NeurIPS 2020
- Key Word: Ordinary Differential Equations.
Digest
a fundamental issue is that the solution to an ordinary differential equation is determined by its initial condition, and there is no mechanism for adjusting the trajectory based on subsequent observations. Here, we demonstrate how this may be resolved through the well-understood mathematics of controlled differential equations.
Finite-sample Analysis of Interpolating Linear Classifiers in the Overparameterized Regime. [paper]
- Niladri S. Chatterji, Philip M. Long. JMLR
- Key Word: Benign Overfitting; Finite-Sample Analysis.
Digest
We prove bounds on the population risk of the maximum margin algorithm for two-class linear classification. For linearly separable training data, the maximum margin algorithm has been shown in previous work to be equivalent to a limit of training with logistic loss using gradient descent, as the training error is driven to zero. We analyze this algorithm applied to random data including misclassification noise. Our assumptions on the clean data include the case in which the class-conditional distributions are standard normal distributions. The misclassification noise may be chosen by an adversary, subject to a limit on the fraction of corrupted labels. Our bounds show that, with sufficient over-parameterization, the maximum margin algorithm trained on noisy data can achieve nearly optimal population risk.
Dissecting Neural ODEs. [paper] [code]
- Stefano Massaroli, Michael Poli, Jinkyoo Park, Atsushi Yamashita, Hajime Asama. NeurIPS 2020
- Key Word: Ordinary Differential Equations.
Digest
Continuous deep learning architectures have recently re-emerged as Neural Ordinary Differential Equations (Neural ODEs). This infinite-depth approach theoretically bridges the gap between deep learning and dynamical systems, offering a novel perspective. However, deciphering the inner working of these models is still an open challenge, as most applications apply them as generic black-box modules. In this work we “open the box”, further developing the continuous-depth formulation with the aim of clarifying the influence of several design choices on the underlying dynamics.
Proving the Lottery Ticket Hypothesis: Pruning is All You Need. [paper]
- Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, Ohad Shamir. ICML 2020
- Key Word: Lottery Ticket Hypothesis.
Digest
The lottery ticket hypothesis (Frankle and Carbin, 2018), states that a randomly-initialized network contains a small subnetwork such that, when trained in isolation, can compete with the performance of the original network. We prove an even stronger hypothesis (as was also conjectured in Ramanujan et al., 2019), showing that for every bounded distribution and every target network with bounded weights, a sufficiently over-parameterized neural network with random weights contains a subnetwork with roughly the same accuracy as the target network, without any further training.
Relative Flatness and Generalization. [paper] [code]
- Henning Petzka, Michael Kamp, Linara Adilova, Cristian Sminchisescu, Mario Boley. NeurIPS 2021
- Key Word: Relative Flatness; Loss Landscape.
Digest
The paper investigates the connection between flatness, a property of the loss curve, and generalization ability in machine learning models, particularly neural networks, providing insights into the conditions under which this connection holds and introducing a novel relative flatness measure that correlates strongly with generalization and resolves the reparameterization issue.
Deep Learning via Dynamical Systems: An Approximation Perspective. [paper]
- Qianxiao Li, Ting Lin, Zuowei Shen.
- Key Word: Approximation Theory; Controllability.
Digest
We build on the dynamical systems approach to deep learning, where deep residual networks are idealized as continuous-time dynamical systems, from the approximation perspective. In particular, we establish general sufficient conditions for universal approximation using continuous-time deep residual networks, which can also be understood as approximation theories in Lp using flow maps of dynamical systems.
Why bigger is not always better: on finite and infinite neural networks. [paper]
- Laurence Aitchison. ICML 2020
- Key Word: Gradient Dynamics.
Digest
We give analytic results characterising the prior over representations and representation learning in finite deep linear networks. We show empirically that the representations in SOTA architectures such as ResNets trained with SGD are much closer to those suggested by our deep linear results than by the corresponding infinite network.
Deep Learning Theory Review: An Optimal Control and Dynamical Systems Perspective. [paper] [code]
- Guan-Horng Liu, Evangelos A. Theodorou.
- Key Word: Mean Field Theory.
Digest
We provide one possible way to align existing branches of deep learning theory through the lens of dynamical system and optimal control. By viewing deep neural networks as discrete-time nonlinear dynamical systems, we can analyze how information propagates through layers using mean field theory.
Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks. [paper] [code]
- Yuanzhi Li, Colin Wei, Tengyu Ma. NeurIPS 2019
- Key Word: Regularization.
Digest
The key insight in our analysis is that the order of learning different types of patterns is crucial: because the small learning rate model first memorizes easy-to-generalize, hard-to-fit patterns, it generalizes worse on hard-to-generalize, easier-to-fit patterns than its large learning rate counterpart.
Are deep ResNets provably better than linear predictors? [paper]
- Chulhee Yun, Suvrit Sra, Ali Jadbabaie. NeurIPS 2019
- Key Word: ResNets; Local Minima.
Digest
We investigated the question whether local minima of risk function of a deep ResNet are better than linear predictors. We showed two motivating examples showing 1) the advantage of ResNets over fully-connected networks, and 2) difficulty in analysis of deep ResNets.
Benign Overfitting in Linear Regression. [paper]
- Peter L. Bartlett, Philip M. Long, Gábor Lugosi, Alexander Tsigler. PNAS
- Key Word: Benign Overfitting.
Digest
The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of the effective rank of the data covariance. It shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size.
Invariance-inducing regularization using worst-case transformations suffices to boost accuracy and spatial robustness. [paper]
- Fanny Yang, Zuowen Wang, Christina Heinze-Deml. NeurIPS 2019
- Key Word: Robustness; Regularization.
Digest
This work provides theoretical and empirical evidence that invariance-inducing regularizers can increase predictive accuracy for worst-case spatial transformations (spatial robustness). Evaluated on these adversarially transformed examples, we demonstrate that adding regularization on top of standard or adversarial training reduces the relative error by 20% for CIFAR10 without increasing the computational cost.
Augmented Neural ODEs. [paper] [code]
- Emilien Dupont, Arnaud Doucet, Yee Whye Teh. NeurIPS 2019
- Key Word: Ordinary Differential Equations.
Digest
We show that Neural Ordinary Differential Equations (ODEs) learn representations that preserve the topology of the input space and prove that this implies the existence of functions Neural ODEs cannot represent. To address these limitations, we introduce Augmented Neural ODEs which, in addition to being more expressive models, are empirically more stable, generalize better and have a lower computational cost than Neural ODEs.
On the Power and Limitations of Random Features for Understanding Neural Networks. [paper]
- Gilad Yehudai, Ohad Shamir.
- Key Word: Random Features.
Digest
Recently, a spate of papers have provided positive theoretical results for training over-parameterized neural networks (where the network size is larger than what is needed to achieve low error). The key insight is that with sufficient over-parameterization, gradient-based methods will implicitly leave some components of the network relatively unchanged, so the optimization dynamics will behave as if those components are essentially fixed at their initial random values. In fact, fixing these explicitly leads to the well-known approach of learning with random features. In other words, these techniques imply that we can successfully learn with neural networks, whenever we can successfully learn with random features. In this paper, we first review these techniques, providing a simple and self-contained analysis for one-hidden-layer networks.
Mean Field Analysis of Deep Neural Networks. [paper]
- Justin Sirignano, Konstantinos Spiliopoulos.
- Key Word: Mean Field Theory.
Digest
We analyze multi-layer neural networks in the asymptotic regime of simultaneously (A) large network sizes and (B) large numbers of stochastic gradient descent training iterations. We rigorously establish the limiting behavior of the multi-layer neural network output. The limit procedure is valid for any number of hidden layers and it naturally also describes the limiting behavior of the training loss.
Machine learning meets quantum physics. [paper] [book]
- Sankar Das Sarma, Dong-Ling Deng, Lu-Ming Duan.
- Key Word: Physics-based Machine Learning; Quantum Physics; Quantum Chemistry.
Digest
The marriage of machine learning and quantum physics may give birth to a new research frontier that could transform both.
A Mean Field Theory of Batch Normalization. [paper]
- Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, Samuel S. Schoenholz. ICLR 2019
- Key Word: Mean Field Theory.
Digest
We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. Our theory shows that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initial weight variances or by adjusting the nonlinear activation function.
Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. [paper] [code]
- Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, Jeffrey Pennington. NeurIPS 2019
- Key Word: Mean Field Theory.
Digest
We show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel.
Superposition of many models into one. [paper] [code]
- Brian Cheung, Alex Terekhov, Yubei Chen, Pulkit Agrawal, Bruno Olshausen. NeurIPS 2019
- Key Word: Parameter Superposition; Catastrophic Forgetting.
Digest
We present a method for storing multiple models within a single set of parameters. Models can coexist in superposition and still be retrieved individually. In experiments with neural networks, we show that a surprisingly large number of models can be effectively stored within a single parameter instance. Furthermore, each of these models can undergo thousands of training steps without significantly interfering with other models within the superposition. This approach may be viewed as the online complement of compression: rather than reducing the size of a network after training, we make use of the unrealized capacity of a network during training.
On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points. [paper]
- Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M. Kakade, Michael I. Jordan. ICML 2017
- Key Word: Gradient Descent; Saddle Points.
Digest
Traditional analyses of GD and SGD show that both algorithms converge to stationary points efficiently. But these analyses do not take into account the possibility of converging to saddle points. More recent theory has shown that GD and SGD can avoid saddle points, but the dependence on dimension in these analyses is polynomial. For modern machine learning, where the dimension can be in the millions, such dependence would be catastrophic. We analyze perturbed versions of GD and SGD and show that they are truly efficient—-their dimension dependence is only polylogarithmic. Indeed, these algorithms converge to second-order stationary points in essentially the same time as they take to converge to classical first-order stationary points.
Escaping Saddle Points with Adaptive Gradient Methods. [paper]
- Matthew Staib, Sashank J. Reddi, Satyen Kale, Sanjiv Kumar, Suvrit Sra. ICML 2019
- Key Word: Gradient Descent; Saddle Points.
Digest
We seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points.
A Spline Theory of Deep Learning. [paper]
- Randall Balestriero, Richard G. Baraniuk. ICML 2018
- Key Word: Approximation Theory.
Digest
We build a rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators. Our key result is that a large class of DNs can be written as a composition of max-affine spline operators (MASOs), which provide a powerful portal through which to view and analyze their inner workings.
On Lazy Training in Differentiable Programming. [paper] [code]
- Lenaic Chizat, Edouard Oyallon, Francis Bach. NeurIPS 2019
- Key Word: Lazy Training.
Digest
In a series of recent theoretical works, it was shown that strongly over-parameterized neural networks trained with gradient-based methods could converge exponentially fast to zero training loss, with their parameters hardly varying. In this work, we show that this “lazy training” phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling, often implicit, that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Through a theoretical analysis, we exhibit various situations where this phenomenon arises in non-convex optimization and we provide bounds on the distance between the lazy and linearized optimization paths.
Why ReLU networks yield high-confidence predictions far away from the training data and how to mitigate the problem. [paper] [code]
- Matthias Hein, Maksym Andriushchenko, Julian Bitterwolf. CVPR 2019
- Key Wrod: ReLU; Adversarial Example.
Digest
We show that ReLU type neural networks which yield a piecewise linear classifier function fail in this regard as they produce almost always high confidence predictions far away from the training data. For bounded domains like images we propose a new robust optimization technique similar to adversarial training which enforces low confidence predictions far away from the training data.
Gradient Descent Finds Global Minima of Deep Neural Networks. [paper]
- Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, Xiyu Zhai. ICML 2019
- Key Word: Gradient Descent; Gradient Dynamics.
Digest
Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architecture. This structure allows us to show the Gram matrix is stable throughout the training process and this stability implies the global optimality of the gradient descent algorithm.
Memorization in Overparameterized Autoencoders. [paper]
- Adityanarayanan Radhakrishnan, Karren Yang, Mikhail Belkin, Caroline Uhler.
- Key Word: Autoencoders; Memorization.
Digest
We show that overparameterized autoencoders exhibit memorization, a form of inductive bias that constrains the functions learned through the optimization process to concentrate around the training examples, although the network could in principle represent a much larger function class. In particular, we prove that single-layer fully-connected autoencoders project data onto the (nonlinear) span of the training examples.
Information Geometry of Orthogonal Initializations and Training. [paper]
- Piotr A. Sokol, Il Memming Park. ICLR 2020
- Key Word: Mean Field Theory; Information Geometry.
Digest
We show a novel connection between the maximum curvature of the optimization landscape (gradient smoothness) as measured by the Fisher information matrix (FIM) and the spectral radius of the input-output Jacobian, which partially explains why more isometric networks can train much faster.
Gradient Descent Provably Optimizes Over-parameterized Neural Networks. [paper]
- Simon S. Du, Xiyu Zhai, Barnabas Poczos, Aarti Singh. ICLR 2019
- Key Word: Gradient Descent; Gradient Dynamics.
Digest
One of the mysteries in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies this surprising phenomenon for two-layer fully connected ReLU activated neural networks. For an m hidden node shallow neural network with ReLU activation and n training data, we show as long as m is large enough and no two inputs are parallel, randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function.
Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function. [paper]
- Wojciech Tarnowski, Piotr Warchoł, Stanisław Jastrzębski, Jacek Tabor, Maciej A. Nowak. AISTATS 2019
- Key Word: Mean Field Theory.
Digest
We demonstrate that in residual neural networks (ResNets) dynamical isometry is achievable irrespectively of the activation function used. We do that by deriving, with the help of Free Probability and Random Matrix Theories, a universal formula for the spectral density of the input-output Jacobian at initialization, in the large network width and depth limit.
Mean Field Analysis of Neural Networks: A Central Limit Theorem. [paper]
- Justin Sirignano, Konstantinos Spiliopoulos.
- Key Word: Mean Field Theory.
Digest
We rigorously prove a central limit theorem for neural network models with a single hidden layer. The central limit theorem is proven in the asymptotic regime of simultaneously (A) large numbers of hidden units and (B) large numbers of stochastic gradient descent training iterations. Our result describes the neural network’s fluctuations around its mean-field limit. The fluctuations have a Gaussian distribution and satisfy a stochastic partial differential equation.
An elementary introduction to information geometry. [paper]
- Frank Nielsen.
- Key Word: Survey; Information Geometry.
Digest
In this survey, we describe the fundamental differential-geometric structures of information manifolds, state the fundamental theorem of information geometry, and illustrate some use cases of these information manifolds in information sciences. The exposition is self-contained by concisely introducing the necessary concepts of differential geometry, but proofs are omitted for brevity.
Deep Convolutional Networks as shallow Gaussian Processes. [paper] [code]
- Adrià Garriga-Alonso, Carl Edward Rasmussen, Laurence Aitchison. ICLR 2019
- Key Word: Gaussian Process.
Digest
We show that the output of a (residual) convolutional neural network (CNN) with an appropriate prior over the weights and biases is a Gaussian process (GP) in the limit of infinitely many convolutional filters, extending similar results for dense networks. For a CNN, the equivalent kernel can be computed exactly and, unlike “deep kernels”, has very few parameters: only the hyperparameters of the original CNN.
Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data. [paper]
- Yuanzhi Li, Yingyu Liang. NeurIPS 2018
- Key Word: Stochastic Gradient Descent.
Digest
Neural networks have many successful applications, while much less theoretical understanding has been gained. Towards bridging this gap, we study the problem of learning a two-layer overparameterized ReLU neural network for multi-class classification via stochastic gradient descent (SGD) from random initialization. In the overparameterized setting, when the data comes from mixtures of well-separated distributions, we prove that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels.
Neural Ordinary Differential Equations. [paper] [code]
- Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, David Duvenaud. NeurIPS 2018
- Key Word: Ordinary Differential Equations; Normalizing Flow.
Digest
We introduce a new family of deep neural network models. Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. We also construct continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, we show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows end-to-end training of ODEs within larger models.
Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks. [paper] [code]
- Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, Jeffrey Pennington. ICML 2018
- Key Word: Mean Field Theory.
Digest
We demonstrate that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme. We derive this initialization scheme theoretically by developing a mean field theory for signal propagation and by characterizing the conditions for dynamical isometry, the equilibration of singular values of the input-output Jacobian matrix.
Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach. [paper]
- Ryo Karakida, Shotaro Akaho, Shun-ichi Amari. AISTATS 2019
- Key Word: Mean Field Theory; Fisher Information.
Digest
The Fisher information matrix (FIM) is a fundamental quantity to represent the characteristics of a stochastic model, including deep neural networks (DNNs). The present study reveals novel statistics of FIM that are universal among a wide class of DNNs. To this end, we use random weights and large width limits, which enables us to utilize mean field theories. We investigate the asymptotic statistics of the FIM’s eigenvalues and reveal that most of them are close to zero while the maximum eigenvalue takes a huge value.
Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks. [paper] [code]
- Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, Nathan Srebro. ICLR 2019
- Key Word: Over-Parametrization.
Digest
We suggest a novel complexity measure based on unit-wise capacities resulting in a tighter generalization bound for two layer ReLU networks. Our capacity bound correlates with the behavior of test error with increasing network sizes (within the range reported in the experiments), and could partly explain the improvement in generalization with over-parametrization.
Understanding Generalization and Optimization Performance of Deep CNNs. [paper]
- Pan Zhou, Jiashi Feng. ICML 2018
- Key Word: Generalization of CNNs.
Digest
We make multiple contributions to understand deep CNNs theoretically. To our best knowledge, this work presents the first theoretical guarantees on both generalization error bound without exponential growth over network depth and optimization performance for deep CNNs.
Geometric Understanding of Deep Learning. [paper]
- Na Lei, Zhongxuan Luo, Shing-Tung Yau, David Xianfeng Gu.
- Key Word: Manifold Representation; Learning Capability; Learning Capability; Latent Probability Distribution Control.
Digest
In this work, we give a geometric view to understand deep learning: we show that the fundamental principle attributing to the success is the manifold structure in data, namely natural high dimensional data concentrates close to a low-dimensional manifold, deep learning learns the manifold and the probability distribution on it.
Tropical Geometry of Deep Neural Networks. [paper]
- Liwen Zhang, Gregory Naitzat, Lek-Heng Lim.
- Key Word: Tropical Geometry; Geometric Complexity.
Digest
We establish a novel connection between feedforward neural networks with ReLU activation and tropical geometry. This equivalence allows us to characterize these neural networks using zonotopes, relate decision boundaries to tropical hypersurfaces, and establish a correspondence between linear regions and vertices of polytopes associated with tropical rational functions. Our tropical formulation reveals that deeper networks exhibit exponentially higher expressiveness compared to shallow networks. This work provides new insights into the relationship between neural networks and tropical geometry.
Gaussian Process Behaviour in Wide Deep Neural Networks. [paper] [code]
- Alexander G. de G. Matthews, Mark Rowland, Jiri Hron, Richard E. Turner, Zoubin Ghahramani. ICLR 2018
- Key Word: Gaussian Process.
Digest
We study the relationship between random, wide, fully connected, feedforward networks with more than one hidden layer and Gaussian processes with a recursive kernel definition. We show that, under broad conditions, as we make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks.
How to Start Training: The Effect of Initialization and Architecture. [paper]
- Boris Hanin, David Rolnick. NeurIPS 2018
- Key Word: Neuron Activation; Weight Initialization.
Digest
We identify and study two common failure modes for early training in deep ReLU nets. The first failure mode, exploding/vanishing mean activation length, can be avoided by initializing weights from a symmetric distribution with variance 2/fan-in and, for ResNets, by correctly weighting the residual modules. We prove that the second failure mode, exponentially large variance of activation length, never occurs in residual nets once the first failure mode is avoided.
The Emergence of Spectral Universality in Deep Networks. [paper]
- Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli. AISTATS 2018
- Key Word: Mean Field Theory.
Digest
We leverage powerful tools from free probability theory to provide a detailed analytic understanding of how a deep network’s Jacobian spectrum depends on various hyperparameters including the nonlinearity, the weight and bias distributions, and the depth. For a variety of nonlinearities, our work reveals the emergence of new universal limiting spectral distributions that remain concentrated around one even as the depth goes to infinity.
Generalization in Machine Learning via Analytical Learning Theory. [paper] [code]
- Kenji Kawaguchi, Yoshua Bengio, Vikas Verma, Leslie Pack Kaelbling.
- Key Word: Regularization; Measure Theory.
Digest
This paper introduces a novel measure-theoretic theory for machine learning that does not require statistical assumptions. Based on this theory, a new regularization method in deep learning is derived and shown to outperform previous methods in CIFAR-10, CIFAR-100, and SVHN. Moreover, the proposed theory provides a theoretical basis for a family of practically successful regularization methods in deep learning. We discuss several consequences of our results on one-shot learning, representation learning, deep learning, and curriculum learning. Unlike statistical learning theory, the proposed learning theory analyzes each problem instance individually via measure theory, rather than a set of problem instances via statistics. As a result, it provides different types of results and insights when compared to statistical learning theory.
Stronger generalization bounds for deep nets via a compression approach [paper]
- Sanjeev Arora, Rong Ge, Behnam Neyshabur, Yi Zhang. ICML 2018
- Key Word: PAC-Bayes; Compression-Based Generalization Bound.
Digest
A simple compression framework for proving generalization bounds, perhaps a more explicit and intuitive form of the PAC-Bayes work. It also yields elementary short proofs of recent generalization results.
Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients? [paper]
- Boris Hanin. NeurIPS 2018
- Key Word: Network Architectures.
Digest
We give a rigorous analysis of the statistical behavior of gradients in a randomly initialized fully connected network N with ReLU activations. Our results show that the empirical variance of the squares of the entries in the input-output Jacobian of N is exponential in a simple architecture-dependent constant beta, given by the sum of the reciprocals of the hidden layer widths.
Welcome to recommend papers that you find interesting and focused on deep phenomena. You can submit an issue or contact me via [email]. Also, if there are any errors in the paper information, please feel free to correct me.
Formatting (The order of the papers is reversed based on the initial submission time to arXiv)