ZeroFlow: Overcoming Catastrophic Forgetting is Easier than You Think

Tao Feng1, Wei Li1, Didi Zhu2, Hangjie Yuan2, Wendi Zheng1, Dan Zhang1, Jie Tang1,
1Tsinghua University, 2DAMO Academy, Alibaba Group

Abstract

Backpropagation provides a generalized configuration for overcoming catastrophic forgetting. Like, SGD and Adam are commonly used for weight updates in continual learning and continual pre-training. In practice, permission to access gradient information is not always granted (the gradient ban), such as black-box APIs, hardware limitations, and non-differentiable systems. To bridge this gap, we introduce the first benchmark ZeroFlow to evaluate gradient-free optimization algorithms for overcoming forgetting. This benchmark examines a suite of forward pass methods across multiple methods, forgetting scenarios, and datasets. We find that forward passes alone are enough to overcome forgetting. Our findings reveal new optimization principles that highlight the potential of forward-pass in mitigating forgetting, managing task conflicts, and reducing memory demands, alongside novel enhancements that further mitigate forgetting with just one forward pass. This work provides essential insights and tools for advancing forward pass methods to overcome forgetting. Code will be available upon publication.

Overview

Overview Image

In real-world scenarios, gradient information is not always available or computable, often referred to as the "gradient ban," making traditional methods for overcoming forgetting unavailable, as backpropagation is restricted or not feasible. ZeroFlow method leverages zeroth-order (ZO) optimization to tackle catastrophic forgetting, focusing on dynamic data flow. By utilizing only forward passes, it eliminates the need for backpropagation, providing a cost-effective solution with minimal computational overhead. Its flexibility, through a range of ZO methods, ensures adaptability across different forgetting scenarios and model types.

Visual Trajectory

FO-SGD

ZO-SGD (q=1)

ZO-SGD (q=4)

ZO-SGD-Sign

ZO-SGD-Conserve

FO-Adam

ZO-Adam (q=1)

ZO-Adam (q=4)

ZO-Adam-Sign

ZO-Adam-Conserve

Bench Results

Model: EASE APER
Optimizer: SGD Adam
Strategy CIFAR-100 CUB ImageNet-A OmniBenchmark
Avg Last Fgt Avg Last Fgt Avg Last Fgt Avg Last Fgt
FO 91.23 85.96 7.32 89.31 83.76 9.61 61.24 51.02 10.84 74.73 67.40 15.11
ZO 78.62 68.40 15.64 88.94 82.91 8.08 57.87 48.32 11.08 73.50 66.60 17.78
Sign 83.21 75.88 10.58 89.81 84.61 8.10 59.15 49.31 11.77 73.81 66.75 17.21
Conserve 82.22 75.88 8.93 89.21 83.42 10.31 58.61 48.58 12.41 77.07 70.73 14.87
Forward 82.26 76.05 8.74 89.26 83.67 9.35 57.76 48.19 11.03 77.00 70.74 14.99

Memory Consumption

First Image Second Image

BibTeX

@article{feng2025zeroflow,
              title={ZeroFlow: Overcoming Catastrophic Forgetting is Easier than You Think},
              author={Feng, Tao and Li, Wei and Zhu, DiDi and Yuan, Hangjie and Zheng, Wendi and Zhang, Dan and Tang, Jie},
              journal={arXiv preprint arXiv:2501.01045},
              year={2025}
            }
            

Acknowledgement

This website is adapted from nerfies. We thank the authors of the PILOT repository for their implementation of a pre-trained model-based continual learning toolbox. We also acknowledge the previous efforts in zeroth-order optimization for large language models (LLMs) by MeZO and ZO-LLM, whose work inspired us to evaluate gradient-free optimization algorithms for overcoming catastrophic forgetting.