THE FUTURE IS HERE

RL's Razor: Why Online Reinforcement Learning Forgets Less (Sep 2025)

Title: RL’s Razor: Why Online Reinforcement Learning Forgets Less (Sep 2025)
Link: http://arxiv.org/abs/2509.04259v1
Date: September 2025

Summary:
This paper compares fine-tuning models with reinforcement learning (RL) and supervised fine-tuning (SFT). It reveals that RL preserves prior knowledge better despite similar new task performance. The degree of forgetting is determined by the KL-divergence between fine-tuned and base policies. RL is implicitly biased towards KL-minimal solutions, unlike SFT. This is validated through experiments with large language models and robotic foundation models. The principle is termed RL’s Razor.

Key Topics:
– Reinforcement Learning (RL)
– Supervised Fine-Tuning (SFT)
– Catastrophic Forgetting
– KL Divergence
– Online Learning
– Foundation Models
– Policy Gradient Methods

Chapters:
00:00 – Intro to Catastrophic Forgetting
00:05 – AI Catastrophic Forgetting Problem
00:19 – Catastrophic Forgetting Definition
00:36 – RL’s Razor: The Paper
00:52 – RL vs SFT
01:10 – RL’s Razor Discovery
01:27 – Learning New Tricks
01:38 – Memory Retention
01:53 – Long-lived Adaptable Agents
02:18 – Static vs Adaptable Models
02:38 – Catastrophic Forgetting Explained
03:17 – Focus on SFT and RL
03:42 – Core Empirical Finding
04:05 – Pareto Frontiers
04:46 – Tested on Actual Models
05:25 – Related Skills
05:40 – Why is RL Better?
06:18 – Systematic Approach
06:46 – Empirical Forgetting Law
07:04 – KL Divergence Defined
07:37 – Shifting Perspectives
07:46 – Parody MNIST
08:21 – KL Diversions Connection
09:02 – Oracle SFT
09:39 – RL’s Implicit Tendency
09:58 – Cause or Correlation?
10:16 – Training Objectives
10:39 – Target Outputs
11:06 – Critical Distinctions
11:36 – Negative Feedback Experiments
12:16 – Clear Cut Results
12:46 – On-Policy Sampling
13:25 – Theoretical Justification
14:06 – Landscape Leap
14:45 – Minimal Projection
15:05 – Ruling Things Out
15:35 – Weight Changes
16:14 – Representation
16:58 – Consequence of RL
17:16 – Sparsity or Lower Rank?
17:53 – Alternative Distances
18:37 – New Way to Think
19:07 – New Design Axis
19:35 – Actionable Principle
19:50 – Learning for Life
20:14 – Open Questions
20:47 – Scaling Questions
21:10 – Off Policy Methods
21:39 – Critical New Perspective
22:00 – Recap: RL’s Razor
22:43 – Provocative Thought