ZeroFlow: Overcoming Catastrophic Forgetting is Easier than You Think

Abstract

Backpropagation provides a generalized configuration for overcoming catastrophic forgetting. Like, SGD and Adam are commonly used for weight updates in continual learning and continual pre-training. In practice, permission to access gradient information is not always granted (the gradient ban), such as black-box APIs, hardware limitations, and non-differentiable systems. To bridge this gap, we introduce the first benchmark ZeroFlow to evaluate gradient-free optimization algorithms for overcoming forgetting. This benchmark examines a suite of forward pass methods across multiple methods, forgetting scenarios, and datasets. We find that forward passes alone are enough to overcome forgetting. Our findings reveal new optimization principles that highlight the potential of forward-pass in mitigating forgetting, managing task conflicts, and reducing memory demands, alongside novel enhancements that further mitigate forgetting with just one forward pass. This work provides essential insights and tools for advancing forward pass methods to overcome forgetting. Code will be available upon publication.

Overview

In real-world scenarios, gradient information is not always available or computable, often referred to as the "gradient ban," making traditional methods for overcoming forgetting unavailable, as backpropagation is restricted or not feasible. ZeroFlow method leverages zeroth-order (ZO) optimization to tackle catastrophic forgetting, focusing on dynamic data flow. By utilizing only forward passes, it eliminates the need for backpropagation, providing a cost-effective solution with minimal computational overhead. Its flexibility, through a range of ZO methods, ensures adaptability across different forgetting scenarios and model types.

Visual Trajectory

FO-SGD

ZO-SGD (q=1)

ZO-SGD (q=4)

ZO-SGD-Sign

ZO-SGD-Conserve

FO-Adam

ZO-Adam (q=1)

ZO-Adam (q=4)

ZO-Adam-Sign

ZO-Adam-Conserve

Red denotes the minimum of new task, blue denotes the minimum of old task. The trajectory taken when using the total loss from both tasks.
The trajectory only indicates the optimization progress of the FO (First-Order) and ZO (Zeroth-Order) algorithms at each iteration. It should not be used to infer the convergence speed, since their training time costs are not equivalent.

Bench Results

Model: EASE APER

Optimizer: SGD Adam

Strategy	CIFAR-100			CUB			ImageNet-A			OmniBenchmark
Strategy	Avg	Last	Fgt	Avg	Last	Fgt	Avg	Last	Fgt	Avg	Last	Fgt
FO	91.23	85.96	7.32	89.31	83.76	9.61	61.24	51.02	10.84	74.73	67.40	15.11
ZO	78.62	68.40	15.64	88.94	82.91	8.08	57.87	48.32	11.08	73.50	66.60	17.78
Sign	83.21	75.88	10.58	89.81	84.61	8.10	59.15	49.31	11.77	73.81	66.75	17.21
Conserve	82.22	75.88	8.93	89.21	83.42	10.31	58.61	48.58	12.41	77.07	70.73	14.87
Forward	82.26	76.05	8.74	89.26	83.67	9.35	57.76	48.19	11.03	77.00	70.74	14.99
FO	90.56	84.82	7.69	84.44	77.10	10.51	59.60	47.20	19.08	74.27	66.28	15.63
ZO	83.36	76.09	10.16	89.49	84.14	8.67	58.90	48.72	12.35	76.15	69.69	15.87
Sign	83.14	76.01	10.44	89.82	84.65	8.21	58.97	48.85	12.20	77.12	71.08	14.68
Conserve	82.15	75.65	9.24	89.82	84.61	8.40	59.23	48.85	12.81	77.19	70.99	14.68
Forward	82.26	76.05	8.74	89.26	83.67	9.35	57.76	48.19	11.03	77.00	70.74	14.99
FO	82.31	76.21	7.33	90.56	85.16	5.19	59.50	49.37	9.91	78.61	72.21	7.87
ZO	82.33	76.21	7.36	90.53	85.20	5.12	59.58	49.51	10.02	78.60	72.21	7.85
Sign	82.32	76.23	7.32	90.42	85.28	4.96	59.65	49.77	9.89	78.60	72.26	7.78
Conserve	82.31	76.21	7.33	90.62	85.28	5.05	59.68	49.70	10.18	78.61	72.21	7.87
Forward	82.32	76.22	7.32	89.47	83.38	6.24	58.25	47.99	9.62	77.61	71.45	7.87
FO	82.31	76.21	7.33	90.56	85.16	5.19	59.60	49.77	10.06	76.60	72.21	7.85
ZO	82.12	75.45	7.47	90.33	84.31	6.01	58.89	49.24	9.32	78.44	72.10	7.87
Sign	82.01	75.60	7.38	89.86	84.18	5.99	57.82	48.12	9.72	78.26	72.05	7.75
Conserve	82.21	75.98	7.34	89.96	84.48	5.90	57.86	47.53	10.00	78.61	72.21	7.87
Forward	82.32	76.22	7.32	89.47	83.38	6.24	58.25	47.99	9.62	77.61	71.45	7.87

EASE and APER are initially configured with ViT-B/16-IN21K and subsequently fine-tuned on special datasets.
More benchmark results will be added soon.

Memory Consumption

BibTeX

@article{feng2025zeroflow,
              title={ZeroFlow: Overcoming Catastrophic Forgetting is Easier than You Think},
              author={Feng, Tao and Li, Wei and Zhu, DiDi and Yuan, Hangjie and Zheng, Wendi and Zhang, Dan and Tang, Jie},
              journal={arXiv preprint arXiv:2501.01045},
              year={2025}
            }

Acknowledgement

This website is adapted from nerfies. We thank the authors of the PILOT repository for their implementation of a pre-trained model-based continual learning toolbox. We also acknowledge the previous efforts in zeroth-order optimization for large language models (LLMs) by MeZO and ZO-LLM, whose work inspired us to evaluate gradient-free optimization algorithms for overcoming catastrophic forgetting.