Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

In this video, I will explain Reinforcement Learning from Human Feedback (RLHF) which is used to align, among others, models like ChatGPT. I will start by introducing how Language Models work and what we mean by AI alignment. In the second part of the video, I will derive from first principles the Policy Gradient Optimization algorithm, by explaining also the problems with the gradient calculation. I will describe the techniques used to reduce the variance of the estimator (by introducing the baseline) and how Off-Policy learning can make the training tractable.
I will also describe how to build the reward model and explain the loss function of the reward model.
To calculate the gradient of the policy, we need to calculate the log probabilities of the state-action pairs (the trajectories), the value function and the rewards, and the advantage terms (through Generalized Advantage Estimation): I will explain visually every step.
After explaining Gradient Policy Optimization, I will introduce the Proximal Policy Optimization algorithm and its loss function, explaining all the details, including the loss of the value head and the entropy.
In the last part of the video, I go through the implementation of RLHF/PPO, explaining line-by-line the entire process.

For every mathematical formula, I will always given a visual intuition to help those who lack the mathematical background.

PPO paper: Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. - https://arxiv.org/abs/1707.06347

InstructGPT paper: Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A. and Schulman, J., 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, pp.27730-27744. - https://arxiv.org/abs/2203.02155

Generalized Advantage Estimation paper: Schulman, J., Moritz, P., Levine, S., Jordan, M. and Abbeel, P., 2015. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. - https://arxiv.org/abs/1506.02438

Slides PDF and commented code: https://github.com/hkproj/rlhf-ppo

Chapters
00:00:00 - Introduction
00:03:52 - Intro to Language Models
00:05:53 - AI Alignment
00:06:48 - Intro to RL
00:09:44 - RL for Language Models
00:11:01 - Reward model
00:20:39 - Trajectories (RL)
00:29:33 - Trajectories (Language Models)
00:31:29 - Policy Gradient Optimization
00:41:36 - REINFORCE algorithm
00:44:08 - REINFORCE algorithm (Language Models)
00:45:15 - Calculating the log probabilities
00:49:15 - Calculating the rewards
00:50:42 - Problems with Gradient Policy Optimization: variance
00:56:00 - Rewards to go
00:59:19 - Baseline
01:02:49 - Value function estimation
01:04:30 - Advantage function
01:10:54 - Generalized Advantage Estimation
01:19:50 - Advantage function (Language Models)
01:21:59 - Problems with Gradient Policy Optimization: sampling
01:24:08 - Importance Sampling
01:27:56 - Off-Policy Learning
01:33:02 - Proximal Policy Optimization (loss)
01:40:59 - Reward hacking (KL divergence)
01:43:56 - Code walkthrough
02:13:26 - Conclusion

For every mathematical formula, I will always given a visual intuition to help those who lack the mathematical background.

PPO paper: Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. – https://arxiv.org/abs/1707.06347

Slides PDF and commented code: https://github.com/hkproj/rlhf-ppo

Chapters
00:00:00 – Introduction
00:03:52 – Intro to Language Models
00:05:53 – AI Alignment
00:06:48 – Intro to RL
00:09:44 – RL for Language Models
00:11:01 – Reward model
00:20:39 – Trajectories (RL)
00:29:33 – Trajectories (Language Models)
00:31:29 – Policy Gradient Optimization
00:41:36 – REINFORCE algorithm
00:44:08 – REINFORCE algorithm (Language Models)
00:45:15 – Calculating the log probabilities
00:49:15 – Calculating the rewards
00:50:42 – Problems with Gradient Policy Optimization: variance
00:56:00 – Rewards to go
00:59:19 – Baseline
01:02:49 – Value function estimation
01:04:30 – Advantage function
01:10:54 – Generalized Advantage Estimation
01:19:50 – Advantage function (Language Models)
01:21:59 – Problems with Gradient Policy Optimization: sampling
01:24:08 – Importance Sampling
01:27:56 – Off-Policy Learning
01:33:02 – Proximal Policy Optimization (loss)
01:40:59 – Reward hacking (KL divergence)
01:43:56 – Code walkthrough
02:13:26 – Conclusion

THE FUTURE IS HERE

AI Now

Study Biomedical Engineering at Trinity College Dublin

Duke Engineering TALKS: Robert Malkin, PhD

The Future of Engineering & Additive Manufacturing – Michael Robinson | Podcast #135

Top 10 NEW Humanoid Robots of 2024 (Updated)

Female Humanoid Robot Offers a NEW ability That Will SHOCK YOU

Meta Quantized Llama 3.2 1B and 3B! (FASTEST LLM Models in 2024?)

Chinas Answer To The Teslabot Is HERE! (Engine AI's Humanoid Robot)

The Future of Creativity: The Fusion of AI and Human Creativity in Video Creation

Will Human Creativity Survive In The Age of AI?

Here's Why I Believe AI will never kill human creativity!