THE FUTURE IS HERE

How AI Learns to Reason with Reinforcement Learning

Reinforcement learning algorithms are the key driving force for training reasoning LLMs (e.g., DeepSeek-R1, Google’s Gemini pro, OpenAI’s o1/o3).

This video provides an overview of the key ideas of these reinforcement learning algorithms, covering the development from REINFORCE, Value function estimation, Actor-critic methods, Generalized Advantage Estimation, TRPO, PPO, and GRPO.

00:00 Introduction
00:43 Notation
02:41 Policy gradient
05:11 Decomposing trajectory into states and actions
07:05 Baseline subtraction
07:58 Value function estimation
08:31 Advantage estimation
11:11 Actor-critic methods
12:16 Trust region policy optimization
16:48 ProximalPolicyOptimization
19:55 Group Relative Policy Optimization
21:58 Dr. GRPO

=== Resources ===
Three excellent resources I found particularly useful (if you are interested in learning more).
– Foundations of Deep RL — 6-lecture series by Pieter Abbeel https://www.youtube.com/playlist?list=PLwRJQ4m4UJjNymuBM9RdmB3Z9N5-0IlY0

– DeepMind x UCL | Introduction to Reinforcement Learning by David Silver
https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ

– Reinforcement Learning: An Introduction http://www.incompleteideas.net/book/the-book-2nd.html

=== References ===
– REINFORCE https://link.springer.com/content/pdf/10.1007/BF00992696.pdf
– Actor-critic: https://arxiv.org/abs/1602.01783
– GAE: https://arxiv.org/abs/1506.02438
– TRPO: https://arxiv.org/abs/1502.05477
– PPO: https://arxiv.org/abs/1707.06347
– GRPO: https://arxiv.org/pdf/2402.03300
– DeepSeek-R1: https://arxiv.org/abs/2501.12948
– Dr. GRPO: https://arxiv.org/abs/2503.20783