FO-SGD
Backpropagation provides a generalized configuration for overcoming catastrophic forgetting. Like, SGD and Adam are commonly used for weight updates in continual learning and continual pre-training. In practice, permission to access gradient information is not always granted (the gradient ban), such as black-box APIs, hardware limitations, and non-differentiable systems. To bridge this gap, we introduce the first benchmark ZeroFlow to evaluate gradient-free optimization algorithms for overcoming forgetting. This benchmark examines a suite of forward pass methods across multiple methods, forgetting scenarios, and datasets. We find that forward passes alone are enough to overcome forgetting. Our findings reveal new optimization principles that highlight the potential of forward-pass in mitigating forgetting, managing task conflicts, and reducing memory demands, alongside novel enhancements that further mitigate forgetting with just one forward pass. This work provides essential insights and tools for advancing forward pass methods to overcome forgetting. Code will be available upon publication.
In real-world scenarios, gradient information is not always available or computable, often referred to as the "gradient ban," making traditional methods for overcoming forgetting unavailable, as backpropagation is restricted or not feasible. ZeroFlow method leverages zeroth-order (ZO) optimization to tackle catastrophic forgetting, focusing on dynamic data flow. By utilizing only forward passes, it eliminates the need for backpropagation, providing a cost-effective solution with minimal computational overhead. Its flexibility, through a range of ZO methods, ensures adaptability across different forgetting scenarios and model types.
FO-SGD
ZO-SGD (q=1)
ZO-SGD (q=4)
ZO-SGD-Sign
ZO-SGD-Conserve
FO-Adam
ZO-Adam (q=1)
ZO-Adam (q=4)
ZO-Adam-Sign
ZO-Adam-Conserve
Strategy | CIFAR-100 | CUB | ImageNet-A | OmniBenchmark | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Avg | Last | Fgt | Avg | Last | Fgt | Avg | Last | Fgt | Avg | Last | Fgt | |
FO | 91.23 | 85.96 | 7.32 | 89.31 | 83.76 | 9.61 | 61.24 | 51.02 | 10.84 | 74.73 | 67.40 | 15.11 |
ZO | 78.62 | 68.40 | 15.64 | 88.94 | 82.91 | 8.08 | 57.87 | 48.32 | 11.08 | 73.50 | 66.60 | 17.78 |
Sign | 83.21 | 75.88 | 10.58 | 89.81 | 84.61 | 8.10 | 59.15 | 49.31 | 11.77 | 73.81 | 66.75 | 17.21 |
Conserve | 82.22 | 75.88 | 8.93 | 89.21 | 83.42 | 10.31 | 58.61 | 48.58 | 12.41 | 77.07 | 70.73 | 14.87 |
Forward | 82.26 | 76.05 | 8.74 | 89.26 | 83.67 | 9.35 | 57.76 | 48.19 | 11.03 | 77.00 | 70.74 | 14.99 |
@article{feng2025zeroflow,
title={ZeroFlow: Overcoming Catastrophic Forgetting is Easier than You Think},
author={Feng, Tao and Li, Wei and Zhu, DiDi and Yuan, Hangjie and Zheng, Wendi and Zhang, Dan and Tang, Jie},
journal={arXiv preprint arXiv:2501.01045},
year={2025}
}
This website is adapted from nerfies. We thank the authors of the PILOT repository for their implementation of a pre-trained model-based continual learning toolbox. We also acknowledge the previous efforts in zeroth-order optimization for large language models (LLMs) by MeZO and ZO-LLM, whose work inspired us to evaluate gradient-free optimization algorithms for overcoming catastrophic forgetting.