Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

In this video, I will explain Reinforcement Learning from Human Feedback (RLHF) which is used to align, among others, models like ChatGPT. I will start by introducing how Language Models work and what we mean by AI alignment. In the second part of the video, I will derive from first principles the Policy Gradient Optimization algorithm, by explaining also the problems with the gradient calculation. I will describe the techniques used to reduce the variance of the estimator (by introducing the baseline) and how Off-Policy learning can make the training tractable.
I will also describe how to build the reward model and explain the loss function of the reward model.
To calculate the gradient of the policy, we need to calculate the log probabilities of the state-action pairs (the trajectories), the value function and the rewards, and the advantage terms (through Generalized Advantage Estimation): I will explain visually every step.
After explaining Gradient Policy Optimization, I will introduce the Proximal Policy Optimization algorithm and its loss function, explaining all the details, including the loss of the value head and the entropy.
In the last part of the video, I go through the implementation of RLHF/PPO, explaining line-by-line the entire process.

For every mathematical formula, I will always given a visual intuition to help those who lack the mathematical background.

PPO paper: Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. –

InstructGPT paper: Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A. and Schulman, J., 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, pp.27730-27744. –

Generalized Advantage Estimation paper: Schulman, J., Moritz, P., Levine, S., Jordan, M. and Abbeel, P., 2015. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. –

Slides PDF and commented code:

00:00:00 – Introduction
00:03:52 – Intro to Language Models
00:05:53 – AI Alignment
00:06:48 – Intro to RL
00:09:44 – RL for Language Models
00:11:01 – Reward model
00:20:39 – Trajectories (RL)
00:29:33 – Trajectories (Language Models)
00:31:29 – Policy Gradient Optimization
00:41:36 – REINFORCE algorithm
00:44:08 – REINFORCE algorithm (Language Models)
00:45:15 – Calculating the log probabilities
00:49:15 – Calculating the rewards
00:50:42 – Problems with Gradient Policy Optimization: variance
00:56:00 – Rewards to go
00:59:19 – Baseline
01:02:49 – Value function estimation
01:04:30 – Advantage function
01:10:54 – Generalized Advantage Estimation
01:19:50 – Advantage function (Language Models)
01:21:59 – Problems with Gradient Policy Optimization: sampling
01:24:08 – Importance Sampling
01:27:56 – Off-Policy Learning
01:33:02 – Proximal Policy Optimization (loss)
01:40:59 – Reward hacking (KL divergence)
01:43:56 – Code walkthrough
02:13:26 – Conclusion