### AMAZON

In this video, I will explain Reinforcement Learning from Human Feedback (RLHF) which is used to align, among others, models like ChatGPT. I will start by introducing how Language Models work and what we mean by AI alignment. In the second part of the video, I will derive from first principles the Policy Gradient Optimization algorithm, by explaining also the problems with the gradient calculation. I will describe the techniques used to reduce the variance of the estimator (by introducing the baseline) and how Off-Policy learning can make the training tractable.

I will also describe how to build the reward model and explain the loss function of the reward model.

To calculate the gradient of the policy, we need to calculate the log probabilities of the state-action pairs (the trajectories), the value function and the rewards, and the advantage terms (through Generalized Advantage Estimation): I will explain visually every step.

After explaining Gradient Policy Optimization, I will introduce the Proximal Policy Optimization algorithm and its loss function, explaining all the details, including the loss of the value head and the entropy.

In the last part of the video, I go through the implementation of RLHF/PPO, explaining line-by-line the entire process.

For every mathematical formula, I will always given a visual intuition to help those who lack the mathematical background.

PPO paper: Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. – https://arxiv.org/abs/1707.06347

InstructGPT paper: Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A. and Schulman, J., 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, pp.27730-27744. – https://arxiv.org/abs/2203.02155

Generalized Advantage Estimation paper: Schulman, J., Moritz, P., Levine, S., Jordan, M. and Abbeel, P., 2015. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. – https://arxiv.org/abs/1506.02438

Slides PDF and commented code: https://github.com/hkproj/rlhf-ppo

Chapters

00:00:00 – Introduction

00:03:52 – Intro to Language Models

00:05:53 – AI Alignment

00:06:48 – Intro to RL

00:09:44 – RL for Language Models

00:11:01 – Reward model

00:20:39 – Trajectories (RL)

00:29:33 – Trajectories (Language Models)

00:31:29 – Policy Gradient Optimization

00:41:36 – REINFORCE algorithm

00:44:08 – REINFORCE algorithm (Language Models)

00:45:15 – Calculating the log probabilities

00:49:15 – Calculating the rewards

00:50:42 – Problems with Gradient Policy Optimization: variance

00:56:00 – Rewards to go

00:59:19 – Baseline

01:02:49 – Value function estimation

01:04:30 – Advantage function

01:10:54 – Generalized Advantage Estimation

01:19:50 – Advantage function (Language Models)

01:21:59 – Problems with Gradient Policy Optimization: sampling

01:24:08 – Importance Sampling

01:27:56 – Off-Policy Learning

01:33:02 – Proximal Policy Optimization (loss)

01:40:59 – Reward hacking (KL divergence)

01:43:56 – Code walkthrough

02:13:26 – Conclusion