The Lottery Ticket Hypothesis: Why Pruned Models Can Sometimes Learn Just as Well as Full Networks

Written by yashgupta1427 | Published 2025/10/24
Tech Story Tags: neural-networks | lottery-ticket-hypothesis | pruned-neural-networks | neural-network-pruning | efficient-ai-models | deep-learning-optimization | iterative-pruning | hackernoon-top-story

TLDRThe Lottery Ticket Hypothesis (LTH) proposes that within large neural networks exist smaller subnetworks—or “winning tickets”—that, when properly initialized and trained, can match or even outperform their full-sized counterparts. This article surveys key research exploring LTH’s methods, extensions, and limitations, including its applications in vision, NLP, and reinforcement learning. It examines how pruning, initialization, and iterative retraining contribute to model efficiency, generalization, and theoretical understanding, while also addressing open challenges around scalability, domain transfer, and the deeper mechanisms behind why LTH works.via the TL;DR App

Pruning in neural networks has seen a great deal of research in recent years. It usually means the task of reducing the size of the network by removing its weights or parameters. A recent line of work by Frankle. et al [1] introduced the Lottery Ticket Hypothesis (LTH), which provides crucial insights into finding and training these pruned networks - subnetworks. Since then, a large body of related work has followed. We provide a meta-analysis of the literature around LTH. In this survey, we first briefly discuss the setup and statement of LTH and discuss its merits and flaws as they stand. Then we present some interesting related work in relation to several different dimensions.

Introduction

Techniques for pruning Neural Networks (NN) have been shown to reduce parameter counts. They can be as simple as multiplying the weights W by a binary mask of 0s and 1s. Sometimes, the reduction in parameter count can be as high as 90%.


Some particular motivations for pruning NNs are as outlined below:

  • The memory requirements for handling and training these pruned models are consequently lower.
  • Considering that the subnetworks require much less computation and storage, deployment on smaller/cheaper devices can be made possible.
  • Pruning serves as an implicit regularization - smaller networks generalize better than their unpruned parents.


One of the seminal papers in this area was [1], which introduced the Lottery Ticket Hypothesis (LTH). It also serves as the foundation for our discussion. The body of work before [1] pointed out that even though subnetworks capable of matching or exceeding their unpruned parents existed, it was unclear if they could be trained from scratch. LTH provided a groundbreaking experimentally supported hypothesis in this direction. The hypothesis is as follows:

“A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations.”


Lottery Ticket Hypothesis - Methodology

We identify and discuss a lot of variations to the core idea of LTH. But to set the ground, we first introduce the overall idea through the core methodology of the hypothesis. It is as follows:


  1. Randomly initializes a neural network
  2. Train it till a fixed number of iterations (hyper-parameter or observe convergence).
  3. Prune p% of the parameters according to some masking criterion, i.e., multiply the parameter space by a binary mask m.
  4. Reset the remaining parameters to their original values before training (the values after step 1). These unmasked parameters are called the winning ticket.
  5. Retrain the smaller subnetwork for a comparable number of iterations till convergence and in the same settings as before


Note: There are several variations of training, masking, etc, as we will soon discuss. Even the kind of pruning has several new directions. Apart from the one-shot variant described above, [1] also describes iterative pruning, where the same process is done iteratively - removing the weights and resetting the remaining weights each time. This is usually done over n rounds. We will discuss the pros and cons of the approach vs one-shot pruning in the upcoming sections.

The authors of [1] develop and perform several experiments to bring support for this idea; they do so on the MNIST and CIFAR-10 datasets. Moreover, the authors also present an algorithm to identify and train winning tickets. The tickets they find are less than 10-20% of the size of the original networks. It is conjectured that these tickets have won the initialization lottery, i.e., the initialization of their weights facilitates training on the given dataset. Also, these tickets are found to achieve higher or comparable train and test accuracies and train faster than the unpruned networks.


We briefly summarize the contributions of the paper below:

  1. By pruning, the authors find subnetworks that are able to reach test and train accuracy comparable to original networks. They take comparable or fewer iterations to reach this performance. These subnetworks are termed winning tickets.
  2. To this end, the authors emphasize iterative pruning as a better but computationally expensive variant of standard pruning.
  3. The winning tickets are found to train faster while achieving better generalization at the same time.
  4. The authors present the Lottery Ticket Hypothesis to encapsulate their findings and present experiments demonstrating the idea. This gives a deeper insight into how neural networks function.


The paper only empirically studies the proposed hypothesis. It focuses more on the existence of tickets and less on the other aspects. There are several new directions of research that this analysis opens up. We focus on three key dimensions.

Experimental Setup

Background and Details

We first set up the usual framework adopted by this line of work, and some common findings across the literature.


Fully Connected Networks. The authors of [1] assess the hypothesis applied to the MNIST dataset, using fully connected networks. The networks are randomly initialized, trained to convergence, then pruned, and the remaining connections are reset to zero. The subnetworks are then trained again. The network that they use is the Lenet-300-100 architecture from [3].


For pruning the networks, the authors use a layer-wise pruning technique. They remove a preset percentage of weights with the lowest magnitude from each layer. Moreover, output connections are only pruned at half the rate. The paper also focuses on iterative pruning. This is a slight modification to the standard pruning. Here, we repeatedly train and prune the network over n rounds and keep the weights that survive.


The paper finds that this works better in practice over standard pruning (which is one-shot). The observations are as follows:

  1. The winning tickets train faster than the original networks.
  2. A winning ticket comprising 51.3% of the weights from the original network reaches higher test accuracy faster than the original network, but slower than when the fraction is 21.1%. At 3.6%, a winning ticket regresses to the performance of the original network.
  3. Initialization is crucial for the efficacy of a winning ticket. Where the winning tickets learn faster as they are pruned, they learn progressively slower when randomly reinitialized.


Convolutional Networks. The authors apply a similar set of experiments to convolutional networks on CIFAR-10. This increases the size and complexity of the problem. The architectures considered are Conv-2, Conv-4, and Conv-6. The results largely mirror the findings from earlier. Some additional insights inferred are:

  1. The training accuracy rises for these subnetworks as well, along with the test. This raises questions about whether these networks generalize better than the original. The experiments reveal that even when all the networks have reached 100% training accuracy, the subnetworks finish with higher accuracy. This indicates they do generalize better.
  2. Test accuracy at early stopping remains steady and improves for Conv-2 and Conv-4. This indicates that at moderate levels of pruning, the structure of the winning tickets alone may lead to better accuracy.


VGG, ResNet Style Networks. In this setting, the authors go one step further and test on VGG and ResNet-style deep networks. Here, rather than pruning layer-wise, the authors do a global pruning, i.e., they prune the bottom quantile of weights over the whole network. This is done because of the variable layer sizes in these deep networks.


The results here are a little more interesting and different from before, namely:


  1. Although the winning tickets are found, they are sensitive to the learning rate. Higher learning rates inhibit pruning from finding winning tickets that perform better. At lower learning rates, the winning tickets emerge.
  2. If trained with a warmup, the pruned networks can close the gap in performance with the original network in the higher learning rate scenario.


Scaling the Lottery Ticket Hypothesis

Late Resetting. From the above discussion about the LTH, it is clear that the hypothesis does not scale too well, especially for the deeper-layered models. [5] presents follow-up work in this regard. In the previous investigation [1], the weights were reset to their values at iteration 0. Moreover, a learning rate warmup was found to be necessary to scale the models. But [5] showed that resetting the tickets to their values at iteration k (where k is smaller than the total iterations), called 'late resetting', produced better winning tickets. This even removed the need for a learning rate warmup. Later works such as [6] independently tested this out, confirmed the finding, and subsequently used it in all their experiments.


Does this setup generalize? The process of finding these tickets requires many cycles, especially in the case of iterative pruning. Training models from scratch, pruning, and then again training till convergence can be very computationally expensive, especially if it has to be done for every combination of the dataset, optimizer, learning rate, etc. Are there ways to generalize the usage of these networks, potentially reuse them, or train them from scratch without initialization from the original network? These questions will be critical to the utility of these networks. For example, if the tickets have inductive biases that generalize across datasets and training settings, then the applicability will increase exponentially since there would be minimal need for the original networks to exploit the advantages to using these subnetworks. A more exciting possibility can be if we can find ‘global tickets’, agnostic to the usual hyperparameters in standard deep learning.


In [6], the authors tried to answer some of these questions. The experiments were performed using the models VGG19 and ResNet50. For evaluating the ticket performance, the authors generated tickets in one configuration (“source”) and tested performance on the other (“target”). To test applicability across datasets, the authors used a variety of datasets, including Fashion MNIST, SVHN, CIFAR-10/100, ImageNet, and Places365.


Their results are summarized as follows:


  1. Transfer within the same data distribution. They divided CIFAR-10 into two halves (CIFAR-10a and CIFAR-10b). They consequently found that winning tickets generated using CIFAR-10a generalized well to CIFAR-10b for both VGG19 and ResNet50.
  2. Transfer across datasets. The goal of this suite of experiments was to test the LTH across different data distributions, but all within the same domain. They find that tickets that generalize across all datasets can be found. Moreover, the tickets generated on large datasets generalized better than the ones generated using smaller models.
  3. Transfer across Optimizers. It is also plausible that the tickets are overfit to the optimizers used and that a different choice will lead the final desirable state to be unreachable. But after testing this out, the author of [6] finds that, at least for VGG, the winning tickets do not overfit to a specific optimizer used, suggesting that these winning tickets are optimizer-independent and are using some other inductive bias to perform so well.


Data

Discussion and Pitfalls. Although authors of [1] have designed and performed several extensive experiments, there is no concrete treatment of the question if the lottery ticket hypothesis is tied to deep neural networks, especially those used for computer vision or image classification tasks. All the models, i.e., the Conv family, VGG, ResNet, and LeNet, as well as the datasets, i.e., MNIST and CIFAR, pertain to this family. Moreover, another question that arises is whether factors like architecture, tasks, learning regimes, optimizer artifacts, etc., are critical as well. In fact, the authors themselves establish that larger models are less likely to have winning tickets. Learning rate and warmup seem to be important criteria as well, for largely unknown reasons. To further cement the effectiveness of the hypothesis and to establish if the hypothesis spans architectural and dataset paradigms, experiments across machine learning domains are needed. It would be interesting to see if the tickets can survive architectural variations like complex gating mechanisms, recurrent dynamics from Natural Language Processing (NLP), unstable training spaces from Reinforcement Learning (RL), etc.


What about other domains? [7] brings important findings in this regard. The paper investigates how the hypothesis holds up in the case of both Natural Language Processing (NLP) and Reinforcement Learning (RL). For NLP, the paper investigates both LSTMs and large language models. Moreover, for RL, they experiment with discrete action space tasks like classical and pixel control.


Their findings are outlined as follows:

  1. They confirm that winning initializations are indeed better, even in these settings under extreme pruning rates, if compared with random initializations.
  2. They find tickets for Transformers, which allow the models, which are as low as a third in size, to achieve optimal performance as the large model.
  3. For RL, the tickets outperformed random initializations on various classical control problems and some instances of Atari games.

Overall, this indicates that the hypothesis is not restricted to image classification but is a more general phenomenon.


Interesting Directions in Pre-training. [8] explores the applicability to further datasets and learning settings through the lens of the lottery ticket hypothesis. The researchers in [8] examine the use of pre-trained models in the field of computer vision through the lens of the lottery ticket hypothesis. The authors extend this idea to pre-trained models and investigate whether matching subnetworks can be found in these models that maintain their downstream transfer performance.


After conducting extensive experiments, the researchers found that such matching subnetworks exist in pre-trained models, with sparsity ranging from 59.04% to 96.48%. These subnetworks transfer well to downstream tasks without degrading performance compared to using the full pre-trained models. The researchers also find that subnetworks from different pre-training methods tend to have different mask structures and perturbation sensitivities.

Methods

Missing Reasoning Ground. The authors have concretely shown the effectiveness of the lottery ticket hypothesis. But this work opens up several open research questions, especially on the underlying mechanism side, primarily because little to no ground has been broken. Why does pruning, especially iterative pruning, work so well? Why is it better than standard pruning? What causes these winning tickets to train faster and achieve better performance than the original networks? More interestingly, how critical is the association of the mask and weights? What makes re-initializing the network perform worse than before? Moreover, several other questions remain, particularly about the choice of weight selection. Some investigation is required to answer the reason for discarding weights based on their norms, and if there might be better techniques that exist.


Underlying Mechanisms of the Lottery Ticket Hypothesis. [9] tries to break the reasoning ground to answer the questions posed in the above section. They first argue that the tickets definitely perform well, but for unknown reasons. They investigate a few critical components and can show that the tickets are more or less agnostic to changes in these factors. They also try to show that techniques like setting to zero, the importance of signs, and the correspondence of masking and training are important.


The findings are summarized below:

  1. Mask Criterion: The masking criteria are not hard and fast. The paper discovers several masking techniques that are able to work well within the context of the lottery ticket hypothesis.
  2. Kept Weights during Retraining Only sign of these weights seems to be important. If we keep the sign, the networks can match performance.
  3. Pruned Weights during Retraining. The paper argues that the benefit of pruning benefit of these weights comes from the fact that these weights were moving to zero anyway.
  4. Importance of Sign The paper establishes that the important performance part of the original initialization is a sign and not the relative magnitude of weights.
  5. They also think of masking as training operations and discover Supermasks, which are able to produce working networks without much training.


Stronger provable hypotheses exist. [10] tries to build a theoretical understanding of the lottery ticket hypothesis. They prove a stronger hypothesis, i.e., for every bounded distribution and every target network with bounded weights, a sufficiently over-parameterized neural network with random weights contains a subnetwork with roughly the same accuracy as the target network, without any further training.


  1. They first show that pruned networks are indeed very expressive and can reach optimal performance.
  2. Their analysis reveals that there is no efficient algorithm for weight pruning of a random network, bringing down pruning to the same as weight optimization. In both cases, solutions exist, but finding them is hard.
  3. The equivalence between optimization and pruning also allows heuristics that have been used earlier to tune weights to motivate similar algorithms for pruning as well, in the interest of computational complexity.

Conclusion

In this study, we studied a recent line of work around the Lottery Ticket Hypothesis. Considering their diverse use cases, it is very important to thoroughly find out the merits and demerits of the approach. We have tried to summarise the original approach as a brief discussion on methodology.


Then we presented three critical dimensions of this work:


  • Expensive Training The training and retraining involved in LTH are very computationally expensive. We tried to gauge the recent works tackling this problem by attempting to generalise and reuse the winning tickets found.
  • Applicability to Other Domains LTH was initially studied for image classification and computer vision in a supervised setting. We explored works extending the approach to domains like NLP and RL. Moreover, we looked into the use cases of LTH in self-supervised learning as well.
  • Theoretical Grounds. We observed little to no reasoning behind the inner workings of LTH in [1]. So we explored recent work working towards identifying critical reasons why LTH does and does not work, and some theoretical results and proofs for the same.


Overall, the body of work around LTH is extremely vast. We have tried to study developments along the above lines. Considering the enormous interest in this field, there are a lot more questions that can arise. Future work can focus on where these aspects stand and what is needed to expand the use cases of LTH as much as possible.

References

[1] Jonathan Frankle and Michael Carbin. “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.” International Conference on Learning Representations, 2018. (Semantic Scholar: https://www.semanticscholar.org/paper/21937ecd9d66567184b83eca3d3e09eb4e6fbd60)

[2] Cameron Wolf. “Towards Data Science, Saga of the Lottery Ticket Hypothesis.” (Article)

[3] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE, 86(11):2278–2324, 1998. https://doi.org/10.1109/5.726791

[4] Robert Tjarko Lange. “The Lottery Ticket Hypothesis: A Survey.” 2020. https://roberttlange.github.io/posts/2020/06/lottery-ticket-hypothesis/

[5] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. “Stabilizing the Lottery Ticket Hypothesis.” arXiv preprint, arXiv:1903.01611, 2019.

[6] Ari Morcos, Haonan Yu, Michela Paganini, and Yuandong Tian. “One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers.” Advances in Neural Information Processing Systems, 32, 2019.

[7] Haonan Yu, Sergey Edunov, Yuandong Tian, and Ari S. Morcos. “Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP.” International Conference on Learning Representations, 2019. (Semantic Scholar: https://www.semanticscholar.org/paper/387e0b95d56e9ecec60a1037ddf7cc57b2851835)

[8] Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, and Zhangyang Wang. “The Lottery Tickets Hypothesis for Supervised and Self-Supervised Pre-Training in Computer Vision Models.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 16306–16316.

[9] Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. “Deconstructing lottery tickets: Zeros, signs, and the supermask.” Advances in Neural Information Processing Systems, 32, 2019.

[10] Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, and Ohad Shamir. “Proving the Lottery Ticket Hypothesis: Pruning is All You Need.” International Conference on Machine Learning, 2020. (Semantic Scholar: https://www.semanticscholar.org/paper/1bcf4553d841ad78cf51b4d3d48a61f9f3c71ebf)


Written by yashgupta1427 | Current @ quantitative solutions for financial markets. Previous @ machine learning and responsible AI research.
Published by HackerNoon on 2025/10/24