Navigating the RLHF Landscape: From Policy Gradients to PPO, GAE, and DPO for LLM Alignment

Welcome to this blog post! If you’re keen on exploring Reinforcement Learning from Human Feedback (RLHF) in large language models (LLMs) and want to understand the process from the ground up—from basic policy gradient methods and the classic REINFORCE algorithm, through the derivation of Proximal Policy Optimization (PPO) using clipping objectives and Generalized Advantage Estimation (GAE) to balance bias and variance, and finally to an in-depth discussion on offline methods like DPO—you’ve come to the right place.

In this post, we start with the fundamentals of gradient policy optimization, gradually build up through the well-established REINFORCE algorithm, and then explore how to implement PPO via clipping objectives. We will also examine how GAE helps strike the ideal balance between bias and variance. Later on, we’ll derive and discuss offline training approaches such as DPO, providing insight into the advantages and challenges of each training path.

1. On-Policy vs. Off-Policy Reinforcement Learning

Today, the mainstream RLHF approaches in LLMs can be broadly categorized into two camps:

On-Policy Methods (exemplified by PPO)
Off-Policy Methods (exemplified by DPO)

But what exactly distinguishes On-Policy from Off-Policy? Here’s a simple rule of thumb:

On-Policy: During training, the model actively generates its own data samples.
Off-Policy: The training relies on pre-collected data (or data produced by another policy), eliminating the need for real-time generation.

Generally speaking, On-Policy methods tend to be more computationally demanding and time-consuming—the main bottleneck being the generation step. For generative tasks, the model must output tokens one at a time, a process that is extremely resource-intensive. Despite this slower pace, On-Policy approaches offer a higher theoretical performance ceiling because they continuously explore and update based on the model’s current state. This advantage will become even clearer when we dive deeper into PPO later.

Let’s first explore the On-Policy approach.

The Essence of On-Policy Learning

At its core, On-Policy learning involves having the model generate its own responses and then scoring those responses to guide subsequent parameter updates. In short, the key idea is to have the model “get in the game” itself.

Imagine you are a model learning to play chess. There are two possible training scenarios:

Method One: You actively play chess, with a coach providing real-time feedback on every move. When you capture an opponent’s piece, your coach cheers you on; if you make an impulsive mistake that leads to a counterattack, your coach immediately advises you on how to improve.
Method Two: Instead of playing, you’re given a collection of recordings—some from professional matches and others from subpar games—annotated to show which moves are effective and which are not. You passively learn by mimicking the good moves and avoiding the bad ones.

The fundamental difference between these two methods is whether you’re actually “playing” the game. Method One represents On-Policy training, where the model generates its own actions and learns from them; Method Two embodies Off-Policy training, where learning is solely based on existing data.

Off-Policy methods are generally faster during training because they rely on readily available data without requiring the model to generate new samples on the fly. However, they are highly sensitive to how well the pre-collected data aligns with the model’s current capabilities. If there’s a significant mismatch—whether the data is too challenging or too trivial—the learning process can suffer. On the other hand, On-Policy training sidesteps this issue since the training samples always reflect the model’s current performance level.

Key Components in an On-Policy Framework

In the context of language models, a typical On-Policy algorithm comprises the following components:

Actor: The model that generates sentences (analogous to you playing chess).
Critic: Acting like a coach, it provides immediate feedback for each generated output and is updated alongside the Actor as the model’s abilities evolve.
Reward Model: Serving as the referee, it assigns final scores or preference evaluations. This component usually remains fixed during training.
Reference Model: A unique element in PPO for large models, it prevents the Actor from straying too far from the original pre-trained distribution, thereby mitigating issues such as reward hacking.

Given that each of these components can be extremely large (often necessitating the simultaneous loading of multiple 70B-parameter models), On-Policy training typically demands enormous computational resources. This is why PPO is often described as being “computationally expensive.”

Next, we will focus on PPO—the most representative On-Policy method—to see how it practically balances training cost with learning efficiency.

2. PPO (Proximal Policy Optimization)

2.1 Starting with Policy Gradient Optimization

Imagine you’re a novice chess player just beginning to learn the game. Your goal is to improve your chances of winning by continuously refining your chess strategy (denoted as $\pi_{\theta}$, where $\theta$ represents your strategy parameters). Think of each game as a trajectory $\tau$ that you want to optimize in order to secure a higher reward.

More generally, the objective in reinforcement learning is to optimize a policy so as to maximize the expected return:

\[\pi^* = \arg \max_\pi J(\pi)\]

Formally, the return of a policy is defined over all possible trajectories:

\[J(\pi_{\theta}) = \int_\tau P(\tau \mid \pi) R(\tau) = \mathbb{E}_{\tau \sim \pi} [R(\tau)]\]

A trajectory is simply a sequence of states and corresponding actions:

\[\tau = (s_0, a_0, s_1, a_1, \dots)\]

In our chess analogy, the state $s_t$ represents the current board configuration, while the action $a_t$ indicates where you decide to move next. The next state is governed by a probability distribution—reflecting, for instance, your opponent’s response:

\[s_{t+1} \sim P(\cdot \mid s_t, a_t)\]

Thus, the probability of a trajectory $\tau$ is given by:

\[P(\tau \mid \pi) = \rho_0(s_0) \prod_{t=0}^{T-1} P(s_{t+1} \mid s_t, a_t) \pi(a_t \mid s_t)\]

In reinforcement learning, we frequently discount future rewards—future returns are never as valuable as immediate ones. Therefore, the total return for a trajectory is defined as:

\[R(\tau) = \sum_{t=0}^\infty \gamma^t r_t\]

where $\gamma \in [0, 1]$ is the discount factor and $r_t$ is the reward received at time $t$.

In deep learning we typically update parameters by minimizing a loss function using stochastic gradient descent. However, since our goal here is to maximize the return, we use stochastic gradient ascent to update the policy:

\[\theta_{k+1} = \theta_k + \alpha \nabla_{\theta} J(\pi_{\theta}) \big|_{\theta_k}\]

Here, $\nabla_{\theta} J(\pi_{\theta})$ is known as the policy gradient. In other words, after each game you review your moves to assess how each one contributed to the final outcome, and then adjust your strategy accordingly. This overall approach is known as a policy gradient algorithm.

However, just as in chess where you must consider all possible moves and board states, computing the exact gradient requires summing or integrating over all possible trajectories. In practice (unless the game is extremely simple), this is computationally infeasible—even if $R(\tau)$ is differentiable—because the long trajectory lengths make auto-differentiation very memory intensive. Therefore, we need to carefully derive a method for computing the policy gradient.

Derivation of the Policy Gradient

To derive a practical expression for the policy gradient—much like reviewing a game move by move—we start with the gradient of the objective function. Viewing each game as a trajectory $\tau$, the gradient of the objective is:

\[\nabla_{\theta}J(\pi_{\theta}) = \nabla_{\theta} \operatorname{E}_{\tau \sim \pi_{\theta}} [R(\tau)]\]

Step 1: Expand the Expectation

This step is equivalent to considering all possible games by expanding the expectation into an integral over all trajectories:

\[= \nabla_{\theta} \int_{\tau} P(\tau\mid\theta) R(\tau) \, d\tau\]

Step 2: Interchange the Gradient and the Integral

Similar to decomposing the influence of each move, we bring the gradient operator inside the integral:

\[= \int_{\tau} \nabla_{\theta} P(\tau\mid\theta) R(\tau) \, d\tau\]

Step 3: Apply the Log-Derivative Trick

By using a mathematical trick known as the log-derivative (or likelihood ratio) trick—much like breaking down the significance of each move—we have:

\[= \int_{\tau} P(\tau\mid\theta) \nabla_{\theta} \log P(\tau\mid\theta) \cdot R(\tau) \, d\tau\]

Step 4: Return to Expectation Form

Finally, we can rewrite the integral back into the expectation form:

\[= \operatorname{E}_{\tau \sim \pi_{\theta}} \left[ \nabla_{\theta} \log P(\tau\mid\theta) \cdot R(\tau) \right]\]

Step 5: Decomposing $\nabla_{\theta} \log P(\tau\mid\theta)$

During a game, every move is determined by your decision at that time. If we represent a game trajectory $\tau$ as:

\[P(\tau\mid\theta) = \rho_0(s_0) \prod_{i=0}^{T-1} P(s_{i+1}\mid s_i, a_i) \pi_{\theta}(a_i\mid s_i)\]

then $\pi_{\theta}(a_i\mid s_i)$ is the probability of selecting a particular move $a_i$ given the board state $s_i$. Taking the logarithm and then the gradient, we obtain:

\[\nabla_{\theta} \log P(\tau\mid\theta) = \sum_{i=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_i\mid s_i)\]

(Note: The opponent’s moves $P(s_{i+1}\mid s_i, a_i)$ are determined by fixed rules and do not depend on $\theta$, so their gradients vanish.)

2.2 Final Policy Gradient Formula

Substituting the above result back into our expectation, we arrive at the final policy gradient formula:

\[\nabla_{\theta} J(\pi_{\theta}) = \operatorname{E}_{\tau \sim \pi_{\theta}} \left[ \sum_{i=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_i\mid s_i) \cdot R(\tau) \right]\]

In this formula, the decision made at each move (captured by $\log \pi_{\theta}$) influences the overall outcome of the game, independent of the fixed rules governing the opponent’s responses. In practice, we approximate this expectation using Monte Carlo sampling—akin to improving your chess skills through repeated play. The sample-based policy gradient can be approximated as:

\[\hat{g} = \frac{1}{\mathcal{D}} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^T \nabla_{\theta} \text{log} \pi_\theta (a_t \mid s_t) R(\tau)\]

If you look closely, you’ll notice that $R(\tau)$ appears directly within the gradient calculation of the policy parameters.

2.3 The REINFORCE Algorithm: Process and Implementation Steps

Let’s now introduce the classic policy gradient method—the REINFORCE algorithm—which is similar to playing games, reviewing your performance, and continuously refining your strategy:

Construct the Policy Network
Build a neural network to define your chess strategy $\pi_{\theta}$:
- Input: The current board state $s_t$
- Output: A probability distribution over the next moves $P(a_t \mid s_t)$
Trajectory Sampling
Play games using the current policy to sample trajectories $\tau$ and record the reward received at each move (for example, a reward after winning a game).
- You can either set a fixed number of moves per game (say, 100 moves) or play until the game ends.
Gradient Computation
Compute the gradient estimate from the collected dataset $\mathcal{D}$ of games, much like assessing the contribution of each move in your review:

\[\hat{g} = \frac{1}{\mid\mathcal{D}\mid} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t\mid s_t) \, R(\tau)\]

Parameter Update
Update your policy parameters using stochastic gradient ascent—just as you would adjust your playing style based on your game reviews:

\[\theta_{k+1} = \theta_k + \alpha \hat{g}\]

or equivalently:

\[\theta_{k+1} = \theta_k + \alpha \nabla_\theta J(\pi_\theta) \big|_{\theta_k}\]

Iterative Optimization
Repeat the cycle of “playing – reviewing – adjusting” until your policy converges and you can consistently perform at a high level.

Explanation of Core Formulas

1. Gradient Estimate Formula

\[\hat{g} = \frac{1}{\mid\mathcal{D}\mid} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^T \underbrace{\nabla_\theta \log \pi_\theta(a_t\mid s_t)}_{\text{Gradient of decision at each move}} \cdot \underbrace{R(\tau)}_{\text{Total reward for the game}}\]

Here, we use extensive Monte Carlo sampling to approximate the full expectation.
The total reward $R(\tau)$ represents the outcome of the entire game, serving as a measure of the cumulative impact of your decisions.

2. Parameter Update Rule

\[\theta_{k+1} = \theta_k + \alpha \hat{g}\]

The term $\alpha$ is the learning rate, analogous to how much you adjust your strategy after reviewing a game; the gradient points in the direction that increases your winning probability.

3. Algorithm Characteristics

Key Advantage: This method relies solely on your actual gameplay experience and does not require any prior knowledge of the opponent’s strategy (model-free).
Computational Requirements: It necessitates a large number of game samples to mitigate the high variance inherent in the gradient estimates.
Possible Improvements: Later approaches (such as Actor-Critic methods) introduce a value function baseline to stabilize policy updates—similar to receiving professional coaching feedback during your game reviews to accelerate improvement.

2.4 Challenges in Policy Gradient Optimization

A central assumption in policy gradient optimization is that we can reliably estimate the policy gradient using our chosen method. However, when the problem scales up—for instance, when each trajectory $\tau$ becomes very long or the policy model is extremely large—you must sample many trajectories to obtain an accurate gradient estimate; otherwise, you face high variance. Although the gradient estimator in policy gradient algorithms is theoretically unbiased (its expected value converges to the true gradient), its variance can be extremely high. Recall the gradient estimate:

\[\hat{g} = \frac{1}{\mid\mathcal{D}\mid} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t \mid s_t) \, R(\tau)\]

where:

$\mid\mathcal{D}\mid$ denotes the size of the dataset $\mathcal{D}$,
$\pi_\theta$ is the current policy (your chess strategy),
$R(\tau)$ is the total return of the game (trajectory $\tau$),
$a_t, s_t$ represent the action and state at time $t$, respectively.

Imagine playing chess and trying to attribute the overall outcome to each move. If you attempt to credit every move with the entire game’s result, the evaluation becomes extremely unstable—that is, it exhibits high variance. Next, we explore methods to reduce this variance in our estimates.

2.5 Reducing Variance: Focusing Only on the Future

Notice in the gradient estimate above that, regardless of the current step $t$, $R(\tau)$ always includes rewards from the entire trajectory. This isn’t entirely reasonable since a decision should only consider its impact on future outcomes—the past cannot be changed and should not influence the evaluation of the current action.

Returning to our chess example: if every move’s score also factors in earlier moves (good or bad), it obscures the true value of the current decision. In practice, when evaluating the current move, you only need to consider the “future rewards”—the rewards accrued from this move until the game’s end. This concept is known as rewards-to-go.

Mathematically, we adjust our gradient estimate as follows:

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \left( \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_{i,t} \mid s_{i,t}) \right) \left( \sum_{t'=t}^T r(s_{i,t'}, a_{i,t'}) \right)\]

Here, $\sum_{t’=t}^T r(s_{i,t’}, a_{i,t’})$ represents the total rewards from the current move to the end of the game. This is analogous to reviewing a game by focusing only on the impact of moves from a certain point forward, ignoring what has already transpired.

By eliminating these redundant past rewards, the variance in our gradient estimates naturally decreases.

2.6 Reducing Variance: Introducing a Baseline

To further mitigate fluctuations in our evaluation, we can subtract a baseline value from the future rewards at each step. Mathematically, this baseline is often denoted as $b$ (in practice, we commonly use the value function $V^\pi(s)$ as the baseline). The modified gradient estimate becomes:

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \left( \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_{i,t} \mid s_{i,t}) \right) \left( \sum_{t'=t}^T r(s_{i,t'}, a_{i,t'}) - b \right)\]

Here, $b$ serves as a baseline—typically a function of the state $s_t$—representing the expected return from the current state. The term $\sum_{t’=t}^T r(s_{i,t’}, a_{i,t’}) - b$ then reflects the advantage, or the amount by which the actual return exceeds this baseline. In practice, we use this advantage instead of the raw reward in our gradient estimates to reduce variance.

In the context of aligning large language models, an extra linear layer is typically added on top of the language model (i.e., the policy $\pi_\theta$) to estimate the expected return $V^\pi(s)$ for a given state. This functions as a benchmark for each state, helping us gauge the true advantage of a decision. For a more intuitive explanation of why a baseline is necessary, see my previous blog post.

2.7 Reducing Variance: Introducing $Q$ and $V$

Earlier we discussed the concept of rewards-to-go, represented by $\sum_{t’=t}^T r(s_{i,t’}, a_{i,t’})$. In reinforcement learning, this term is known as the Q-function $Q^\pi(s, a)$, which captures the total future return when taking action $a$ in state $s$. By subtracting the state value $V^\pi(s)$, we obtain the advantage function:

\[A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)\]

Using our chess analogy, the Q-function quantifies the potential outcome of the game after making a move in state $s$, while the state value represents the inherent strength of the board position. Even if the board is favorable, a poor move can reduce your relative advantage. Therefore, it’s important not only to consider the absolute score of a move but also to compare it against the baseline. A positive $A^\pi(s, a)$ indicates that the move significantly improves your position, whereas a negative value suggests it is suboptimal.

Ultimately, we can express the policy gradient as:

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \left( \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_{i,t} \mid s_{i,t}) \right) A^\pi(s, a)\]

This formula encapsulates how the relative performance of each move (compared to the average) guides the adjustment of your strategy.

2.8 Explaining the Advantage Function

Simply put, the advantage function $A^\pi(s, a)$ tells you how much more likely a specific move $a$ in a given state $s$ is to improve your chances of winning compared to an average move. If a move yields an expected return significantly above the baseline, its advantage is positive—signaling it as a promising move; if not, it indicates underperformance relative to the norm.

In summary, by focusing solely on future rewards, introducing a baseline, and utilizing the advantage function, we can effectively reduce the variance in our gradient estimates. This approach is akin to reviewing a game by concentrating on the pivotal moves that truly alter the outcome, thereby making the policy updates more stable and targeted.

2.9 Estimating the Advantage Term – Using the Iterative GAE Strategy

There are several ways to estimate the advantage term. For example:

\[\hat{A}^\pi(s_t, a_t) = \big[ r(s_t, a_t) + \gamma V^\pi(s_{t+1}) \big] - V^\pi(s_t)\] \[\hat{A}^\pi(s_t, a_t) = \big[ r(s_t, a_t) + \gamma r(s_{t+1}, a_{t+1}) + \gamma^2 V^\pi(s_{t+2}) \big] - V^\pi(s_t)\] \[\hat{A}^\pi(s_t, a_t) = \big[ r(s_t, a_t) + \gamma r(s_{t+1}, a_{t+1}) + \gamma^2 r(s_{t+2}, a_{t+2}) + \gamma^3 V^\pi(s_{t+3}) \big] - V^\pi(s_t)\]

These examples illustrate that by summing over multiple steps, we can balance bias and variance:

Stopping too early in accumulating actual rewards introduces high bias, because only a small portion of the true return is considered alongside minimal actual rewards.
Accumulating too many rewards leads to high variance, since relying on a larger number of real samples can make the estimate unstable.

To strike a balance in this bias-variance trade-off, we employ a weighted sum of these terms, known as Generalized Advantage Estimation (GAE):

\[\delta_t = r_t + \gamma V^\pi(s_{t+1}) - V^\pi(s_t)\] \[\hat{A}_t = \delta_t + \gamma \lambda \hat{A}_{t+1}\]

This is a recursive formula. The advantage estimate at the final time step can be viewed as the first expansion, with each preceding step adding another layer weighted by the decay factor $\lambda$.
By iteratively accumulating these terms across time steps, we balance the high variance of actual rewards with the high bias introduced by relying solely on the value function.

In the context of aligning large language models, this approach guides the policy (i.e., the language model) to increase the probability of selecting the next token that, on average, yields a reward above the baseline given a specific prompt (state). In other words, the model will tend to choose tokens that are more likely to lead to sequences meeting our desired reward criteria—resulting in outputs that are better aligned with our training data distribution.

2.10 The PPO Loss

In Proximal Policy Optimization (PPO), to prevent the policy from changing too drastically during updates, we design a specialized loss function composed of several key components:

Policy Loss

\[L_{\text{POLICY}} = \min \Bigg( \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\text{old}}(a_t \mid s_t)} \hat{A}_t, \; \text{clip}\Bigg(\frac{\pi_\theta(a_t \mid s_t)}{\pi_{\text{old}}(a_t \mid s_t)},\, 1 - \epsilon,\, 1 + \epsilon\Bigg) \, \hat{A}_t \Bigg)\]

This component is akin to not overhauling your strategy all at once when playing chess. Instead, you prefer to fine-tune your moves gradually so that while you improve your position, you avoid taking reckless steps that could disrupt the overall structure.

Value Function Loss

\[L_{\text{VF}} = \frac{1}{2} \left\lVert V_\theta(s) - \left( \sum_{t=0}^T \gamma^t r_t \;\middle|\; s_0 = s \right) \right\rVert_2^2\]

This term ensures that for every state, your estimated expected return (comparable to predicting how a game might unfold) is as close as possible to the actual returns received.

Entropy Loss

\[L_{\text{ENTROPY}} = - \sum_x p(x) \log p(x)\]

The entropy loss encourages the policy to maintain a level of exploration. Much like a skilled chess player who not only masters known patterns but is also willing to experiment with new ideas to remain adaptable, this term prevents the policy from becoming too deterministic.

Overall PPO Loss

\[L_{\text{PPO}} = L_{\text{POLICY}} + c_1 L_{\text{VF}} + c_2 L_{\text{ENTROPY}}\]

By combining these components, we obtain the total PPO loss. This loss function is designed to improve the win rate (or reward) when updating the policy while ensuring that the policy does not deviate too drastically from its original behavior—thus promoting stable and efficient learning.

2.11 Advantages of Using PPO

Stability: The clipping operation ensures that policy updates are modest—just as you wouldn’t drastically change your playing style mid-game, each move remains consistent and measured.
Sample Efficiency: PPO makes effective use of the game data collected, although in large-scale models, a substantial number of samples is still required.
Intrinsic Safety: With clipping updates and KL-penalties relative to the reference policy, PPO effectively prevents drastic deviations during training, ensuring that generated outputs remain consistent with the pre-trained style.

Overall, just as an experienced chess player continuously improves by playing, reviewing, and fine-tuning their strategy, PPO achieves robust and efficient reinforcement learning through precise gradient updates coupled with restrictions on policy changes.

3. Understanding and Deriving GAE (Generalized Advantage Estimation)

Imagine participating in a high-stakes chess tournament. Every move not only affects the current board state but can also have long-lasting consequences for the rest of the game. To determine how much advantage a particular move provides, you must consider both the immediate score and its potential impact on future positions. Generalized Advantage Estimation (GAE) is designed to address exactly this problem—it helps evaluate how much better a certain action in a given state is compared to the average, much like reviewing a game to see whether a move significantly increased your chances of winning.

GAE innovatively combines the ideas of multi-step estimation and temporal difference (TD) learning. By using a series of past errors to predict future returns, it strikes an optimal balance between bias and variance. When used in conjunction with PPO, it’s like a chess player who not only has a robust strategic framework (PPO) but also leverages precise position evaluations (GAE) to continually optimize every move, resulting in more stable and effective overall performance.

3.1 Introducing the Concept of Residuals

When playing chess, you can rarely gauge the full value of a move immediately; instead, you estimate its effect based on your predictions of subsequent moves. The difference between this prediction and the actual outcome is known as the temporal difference (TD) residual.

Why Do We Need Residuals?

Imagine you’re at a critical juncture in a game. Based on experience, you believe that moving left might be advantageous, but after making the move, the outcome isn’t as favorable as anticipated. The discrepancy between your initial prediction and the actual result is the residual—it helps you correct your future estimates to make better decisions later on.

Mathematical Expression of the Residual

For policy gradient methods, the gradient of the value function can be written as:

\[\nabla R_{\theta} = \mathbb{E}_{(a_t, s_t)\sim \pi_\theta} \left [A^{\theta}(a_t, s_t) \nabla \log P_{\theta} (a_t \mid s_t) \right]\]

Here, $A^{\theta}(a_t, s_t)$ represents the advantage (the future return after baseline correction) of taking action $a_t$ in state $s_t$, which we denote as $A(t)$. If we simply define:

\[A(t) = r_t - V(s_t)\]

then, to more accurately reflect the full impact of the current move on future outcomes, we introduce the TD residual. Specifically, for taking action $a_t$ in state $s_t$, arriving at state $s_{t+1}$ with an immediate reward $r_t$, we define the TD residual as:

\[\delta_t = r_t - \big(V(s_t) - \gamma V(s_{t+1})\big)\]

$r_t$ is like the immediate score you receive after making the move.
$V(s_t)$ and $V(s_{t+1})$ are your predicted values for the current and next board positions—similar to a coach’s evaluation.
$\gamma$ is the discount factor, indicating that future rewards are gradually de-emphasized.

Intuitive Explanation of the Residual

Suppose you’re playing chess and your current board state is $s_t$. You choose a move $a_t$ that leads to a new board state $s_{t+1}$ and you receive an immediate reward $r_t$. Your goal is to evaluate the overall impact of that move:

Your estimate of the current state’s value, $V(s_t)$, represents the expected return if you continue playing according to your current policy.
Your estimate for the next state, $V(s_{t+1})$, reflects the expected return from that point onward.

Ideally, you would have:

\[V(s_{t}) = r_t + \gamma V(s_{t+1})\]

That is, the value of the current state should equal the immediate reward plus the discounted future value. In reality, however, $V(s_{t})$ may deviate from this ideal, and the TD residual $\delta_t$ quantifies that error. It tells you how much the actual outcome deviates from your expectation, much like realizing that a particular move had a better or worse impact than you had predicted, thereby providing a basis for adjusting your strategy.

To obtain a more accurate evaluation—just as you would review not just the immediate move but several subsequent moves—we define $A^k(t)$ as the cumulative advantage from time $t$ over the next $k$ moves:

\[\begin{align} A^1(t) & = \delta_t \\ A^2(t) & = \delta_t + \gamma \delta_{t+1} \\ A^3(t) & = \delta_t + \gamma \delta_{t+1} + \gamma^2 \delta_{t+2} \\ A^k(t) & = \sum_{i=0}^{k-1} \gamma^{i} \delta_{t+i} \end{align}\]

When $k$ tends to infinity:

\[A^k(t) = \sum_{i=0}^{\infty} \gamma^{i} \delta_{t+i}\]

Based on the definition of the residual, this expands to:

\[A^k(t) = r_t + \gamma r_{t+1} + \dots + \gamma^{k-1} r_{t+k-1} + \gamma^k V(s_{t+k}) - V(s_t)\]

This is analogous to considering all the scores from the current move until the end of the game. Clearly, the more moves you include (larger $k$), the smaller your bias becomes, but the variance may increase due to more randomness; conversely, focusing only on the immediate moves introduces higher bias but reduces variance.

3.2 The Bias-Variance Tradeoff

In practice, rather than summing rewards over a fixed number of steps $k$, we employ an exponentially weighted approach—this is the core idea behind GAE.

First, we define the TD residual as:

\[\delta_t = r_t + \gamma V^\pi(s_{t+1}) - V^\pi(s_t)\]

Then, we accumulate the advantage using the recursive formula:

\[\hat{A}_t = \delta_t + \gamma \lambda \hat{A}_{t+1}\]

Here, $\lambda \in [0,1)$ determines how much weight you give to future moves:

When $\lambda=1$, you consider the full effect of the entire game (minimal bias but maximal variance).
When $\lambda=0$, you rely solely on the immediate information (maximal bias but minimal variance).

Alternatively, we can represent the multi-step advantage as a weighted sum. Define:

\[A^{\text{GAE1}}(t) = A_t^1 + \lambda A_t^2 + \lambda^2 A_t^3 + \dots\]

Expanding this, we have:

\[\begin{align} A^{\text{GAE1}}(t) & = \delta_t + \lambda(\delta_t + \gamma \delta_{t+1}) + \lambda^2 (\delta_t + \gamma \delta_{t+1} + \gamma^2 \delta_{t+2}) + \dots \\ & = \delta_t (1+\lambda+\lambda^2+\dots) + \gamma \delta_{t+1} (\lambda+\lambda^2+\dots) + \gamma^2 \delta_{t+2} (\lambda^2+\lambda^3+\dots) \end{align}\]

Assuming we sum only the first $k$ terms, using the formula for a geometric series we get:

\[A^{\text{GAE1}}(t) = \delta_t \frac{1 - \lambda^k}{1 - \lambda} + \gamma \delta_{t+1} \frac{\lambda(1 - \lambda^k)}{1 - \lambda} + \gamma^2 \delta_{t+2} \frac{\lambda^2(1 - \lambda^k)}{1 - \lambda} + \dots\]

As $k \rightarrow \infty$:

\[A^{\text{GAE1}}(t) = \delta_t \frac{1}{1 - \lambda} + \gamma \delta_{t+1} \frac{\lambda}{1 - \lambda} + \gamma^2 \delta_{t+2} \frac{\lambda^2}{1 - \lambda} + \dots\]

Multiplying both sides by $1-\lambda$ yields:

\[(1-\lambda) A^{\text{GAE1}}(t) = \delta_t + \gamma\lambda \delta_{t+1} + \gamma^2\lambda^2 \delta_{t+2} + \dots\]

If we define $ A^{\text{GAE}}(t) = (1-\lambda) A^{\text{GAE1}}(t) $, we finally obtain:

\[A^{\text{GAE}}(t) = \delta_t + \gamma\lambda \delta_{t+1} + \gamma^2\lambda^2 \delta_{t+2} + \dots = \sum_{k=0}^{\infty} (\lambda \gamma)^k \delta_{t+k}\]

This formulation clearly shows how adjusting $\lambda$ balances the influence of short-term versus long-term rewards. Just as a chess player neither focuses solely on the next move nor attempts to predict every possible outcome, an appropriate choice of $\lambda$ allows you to weight near-term and far-term effects optimally:

When $\lambda=1$, the advantage estimate includes information from the entire game (minimal bias but high variance):

\[\begin{align} A^{\text{GAE}}(t) & = \sum_{k=0}^{\infty} \gamma^k \delta_{t+k} \\ & = A^k(t) = r_t + \gamma r_{t+1} + \dots + \gamma^{k-1} r_{t+k-1} + \gamma^k V(s_{t+k}) - V(s_t) \end{align}\]

When $\lambda=0$, only the immediate information is considered (maximal bias but minimal variance):

\[A^{\text{GAE}}(t) = \delta_t = r_t - \big(V(s_t) - \gamma V(s_{t+1})\big)\]

In summary, $\lambda$ is a crucial hyperparameter that balances bias and variance:

A larger $\lambda$ means more future observations are taken into account, reducing bias but increasing variance.
A smaller $\lambda$ relies more on the current estimate, leading to higher bias but lower variance.

This method is analogous to a chess player who not only considers the immediate effects of a move but also judiciously weighs the implications of the subsequent moves, thereby making the optimal decision.

4. The Token-per-Token Process for Training LLMs with PPO

Below is a straightforward, token-by-token explanation of what happens when using PPO to train an LLM. This section clarifies the roles of the parameters ($\theta_{\text{old}}$ and $\theta$) and walks through the process of the first update.

4.1 Aligning LLM with PPO

Old Policy Parameters ($\theta_{\text{old}}$): These are the model parameters used to generate data (for example, token sequences). Before each PPO update, you sample tokens using these parameters and record the probability of generating each token (typically stored as a log-probability).
Current Policy Parameters ($\theta$): These are the parameters being updated during training. Through the PPO algorithm, you adjust these parameters based on the sampled data so that the tokens generated better align with the reward signals. After the update, $\theta$ will differ from $\theta_{\text{old}}$.

You can think of $\theta_{\text{old}}$ as the “old version” of the model, while $\theta$ is the “new version” that, after one training iteration, should perform better (or conform more closely to the reward signal). After each update, the new model becomes the “old model” for the next round of sampling.

2. The “Token-per-Token” Training Process

Assume you already have a pre-trained LLM with its parameters set as $\theta_{\text{old}}$. The following outlines the process step by step:

(1) Sampling Phase

Providing a Prompt
You input a prompt into the LLM (using $\theta_{\text{old}}$) to generate text.
Generating Tokens One by One
The model generates the next token (action) based on the current context (state). For instance, when generating the t-th token:
- The current state $s_t$ consists of the prompt along with the preceding $t-1$ tokens.
- The model selects token $a_t$ as its action and assigns it a probability $\pi_{\theta_{\text{old}}}(a_t \mid s_t)$ (typically recorded as a log-probability).
Recording Data
For each token, you log:
- The state $s_t$ (the context)
- The action $a_t$ (the generated token)
- The generation probability (or log-probability) under the old policy
- Optionally, any reward information (such as a score from a reward model) and the estimated value $V(s_t)$.
- This collection forms a trajectory—a sequence of tokens along with their associated data.

(2) Computing the Advantage

In PPO, it is necessary to compute an advantage $A_t$ for each token (using GAE, multi-step, or single-step methods).

For example, for the t-th token, you might compute:

\[A_t = \delta_t + \gamma \lambda\, \delta_{t+1} + \gamma^2 \lambda^2\, \delta_{t+2} + \cdots\]

where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the one-step temporal difference error.

In practice, because the trajectory is finite, the summation is computed only until the end of the sequence.

(3) Update Phase: From $\theta_{\text{old}}$ to $\theta$

Once you have sampled a batch of token data, you update the model parameters using PPO. The token-per-token process unfolds as follows:

Using the Old Policy as Reference
You have recorded the log-probability of each token as generated by $\theta_{\text{old}}$, i.e., $\log \pi_{\theta_{\text{old}}}(a_t \mid s_t)$.
Recomputing Probabilities with the Current Policy
With the current model parameters $\theta$ (initially, $\theta$ is identical to $\theta_{\text{old}}$, but it changes through successive gradient updates), you recalculate the log-probability for generating the same token $a_t$ given the same state $s_t$, denoted by $\log \pi_\theta(a_t \mid s_t)$.
Calculating the Probability Ratio
For each token, compute the ratio:

\[r_t(\theta) = \exp\Big( \log \pi_\theta(a_t \mid s_t) - \log \pi_{\theta_{\text{old}}}(a_t \mid s_t) \Big)\]

This ratio indicates the degree to which the new model’s likelihood of generating that token has changed relative to the old model.

Constructing the PPO Loss (Per-Token Loss)
The goal of PPO is to limit how much the new policy deviates from the old one. For each token, the loss is computed based on the advantage $A_t$ and the probability ratio $r_t(\theta)$:

\[L_t(\theta) = -\min\Big( r_t(\theta) A_t,\ \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \, A_t \Big)\]

In plain terms:

If the product of the probability ratio $r_t(\theta)$ and the advantage $A_t$ falls within the acceptable range (i.e., between $1-\epsilon$ and $1+\epsilon$), that value is used.
Otherwise, if the ratio is too high or too low, the clipped value is used to prevent excessively large updates.

Averaging Over All Tokens and Updating $\theta$
The losses $L_t(\theta)$ for all tokens are averaged to form the overall batch loss. You then update the model parameters $\theta$ using gradient descent (or another optimizer).
- At this point, $\theta$ starts to diverge from $\theta_{\text{old}}$, meaning the new model has been “improved” over the old one.
Updating $\theta_{\text{old}}$
After completing a full PPO update (usually over several epochs on the same batch of data), you set $\theta_{\text{old}}$ to the current $\theta$ to prepare for the next round of sampling.

3. Pseudocode

Below is a pseudocode-style algorithm that outlines the token-per-token PPO update process:

# Initialization: Set the pre-trained LLM parameters as θ_old, and copy them to θ
θ_old = PretrainedLLM.parameters
θ = copy(θ_old)

# Sampling Phase: Generate a batch of data using θ_old
for each prompt in dataset:
    trajectory = []
    state = prompt
    while not end_of_sequence:
        token, logpi_old = θ_old.generate_token(state)
        # Record the current state, token, and the log-probability under θ_old
        trajectory.append( (state, token, logpi_old, reward, V(state)) )
        state = state + token  # Update the state (append token)
    store trajectory

# Compute Advantages (e.g., using GAE)
for each trajectory:
    for t from last token downto first:
        δ_t = reward[t] + γ * V(state[t+1]) - V(state[t])
        A_t = δ_t + γ * λ * A[t+1]  # Recursively compute the advantage

# PPO Update Phase: Multiple epochs
for each PPO update epoch:
    for each token data (state s_t, token a_t, logpi_old, A_t) in batch:
        # 1. Compute the log-probability under the current policy
        logpi_current = θ.log_probability(s_t, a_t)
        # 2. Calculate the probability ratio
        r_t = exp( logpi_current - logpi_old )
        # 3. Compute the unclipped and clipped objectives
        loss_unclipped = r_t * A_t
        loss_clipped = clip(r_t, 1-ε, 1+ε) * A_t
        # 4. The token loss is the negative of the minimum of these two values
        loss_token = -min(loss_unclipped, loss_clipped)
    # 5. Average the loss over all tokens and perform a gradient update
    θ = Update(θ, average(loss_token over batch))

# After updating, copy θ to θ_old for the next round of sampling
θ_old = copy(θ)

This pseudocode summarizes the token-per-token PPO update process for training an LLM, illustrating how the old and new policy parameters are used and updated iteratively.

5. DPO: Learning Chess by Studying Chess Manual

Earlier, we discussed how PPO is similar to having a coach on a live chessboard—guiding you in real time as you play and adapt your strategy (i.e., online learning). In contrast, DPO is more like sitting at home and studying a chess manual (i.e., using offline data), where you deduce how to improve your moves based on pre-existing win–loss comparisons. In this section, we will derive the mathematical principles behind DPO (Direct Preference Optimization) and explain its strengths and limitations compared to PPO (or, more generally, other RLHF approaches).

Before we dive in, keep in mind these three key objective functions:

$r_{\boldsymbol\phi}$ represents the Reward Model.
$\pi_{\boldsymbol\theta}$ is the aligned model (policy) we want to train.
$\pi_{\text{ref}}$ is the reference model (which both PPO and DPO rely on to keep the policy from drifting too far).

These three objectives are defined as follows:

Reward Model Loss:

\[\max_{r_{\boldsymbol\phi}} \Bigl\{ \mathbb{E}_{x,\, y_{\text{w}},\, y_{\text{l}} \sim D} \Bigl[\log \sigma(x, y_{\text{w}}) - \log \sigma(x, y_{\text{l}})\Bigr] \Bigr\}\]

PPO Loss:

\[\max_{\pi_{\boldsymbol\theta}} \Biggl\{ \mathbb{E}_{x \sim D,\, y \sim \pi_{\boldsymbol\theta}(y\mid x)} \Bigl[r_{\boldsymbol\phi}(x, y)\Bigr] \;-\; \beta \,\mathbb{D}_{KL}\Bigl[\pi_{\boldsymbol\theta}(y\mid x) \,\Big\|\, \pi_{\text{ref}}(y\mid x)\Bigr] \Biggr\}\]

DPO Loss:

\[\max_{\pi_{\boldsymbol\theta}} \Bigl\{ \mathbb{E}_{x,\, y_{\text{w}},\, y_{\text{l}} \sim D} \Bigl[ \log \sigma\Bigl( \beta \log \frac{\pi_{\boldsymbol\theta} (y_{\text{w}} \mid x)}{\pi_\text{ref}(y_{\text{w}} \mid x)} \;-\; \beta \log \frac{\pi_{\boldsymbol\theta} (y_{\text{l}} \mid x)}{\pi_\text{ref}(y_{\text{l}} \mid x)} \Bigr) \Bigr] \Bigr\}\]

Here, the KL divergence is defined as:

\[\mathbb{D}_{KL}(P\|Q) =\;\sum_i P(i)\,\log\!\Bigl(\frac{P(i)}{Q(i)}\Bigr).\]

In essence, while PPO adjusts the policy based on live feedback using a KL penalty to tether the new policy to the reference, DPO leverages pre-collected comparative data (the “chess game records”) to directly optimize the policy based on preference comparisons. This approach has its own advantages and challenges, which we will explore in the following sections.

5.1 Directly Solving for the Optimal Aligned Model from the Optimization Objective

Let’s begin with the PPO loss function and perform a series of mathematical transformations. This is analogous to a chess coach (the Reward Model) providing real-time feedback while applying a KL divergence penalty to prevent your policy from straying too far from a reference model.

Substitute the KL Divergence Formula:

\[\underset{\pi_{\boldsymbol\theta}}{\max}\,\Bigl\{ \mathbb{E}_{x \sim \mathcal{D},\,y \sim \pi_{\boldsymbol\theta}(y\mid x)}\Bigl[r_{\phi}(x,y) \;-\;\beta \log\Bigl(\tfrac{\pi_{\theta}(y\mid x)}{\pi_{\text{ref}}(y\mid x)}\Bigr)\Bigr]\Bigr\}.\]

Extract the Constant $-\tfrac{1}{\beta}$ and Apply an Identity Transformation:

\[\underset{\pi_{\theta}}{\min}\,\Bigl\{ \mathbb{E}_{x \sim \mathcal{D},\,y \sim \pi_{\theta}(y\mid x)} \Bigl[\log\tfrac{\pi_{\theta}(y\mid x)}{\pi_{\text{ref}}(y\mid x)}\;-\;\tfrac{1}{\beta}r_{\phi}(x,y)\Bigr] \Bigr\}.\]

Continue the Transformation:

Obtain:

\[\underset{\pi_{\theta}}{\min}\,\Bigl\{ \mathbb{E}_{x \sim \mathcal{D},\,y \sim \pi_{\theta}(y\mid x)} \Bigl[\log\tfrac{\pi_{\theta}(y\mid x)}{\pi_{\text{ref}}(y\mid x)\,\exp\!\Bigl(r_{\phi}(x,y)/\beta\Bigr)}\Bigr] \Bigr\}.\]

At this point, we have constructed a new distribution $\pi_{r}(y\mid x)$ defined as:

\[\pi_{r}(y\mid x) = \frac{\pi_{\text{ref}}(y\mid x) e^{r_{\phi}(x,y) / \beta}}{Z(x)}\]

where $Z(x)$ is a normalization constant that ensures $\pi_r$ is a proper probability distribution (i.e., the probabilities sum to 1):

\[Z(x) =\;\sum_{y}\,\pi_{\text{ref}}(y\mid x)\,\exp\!\Bigl(\tfrac{r_{\phi}(x,y)}{\beta}\Bigr).\]

In this expression, the numerator represents the expected reward for a given input pair $(x, y)$, while the denominator aggregates the reward expectations for all possible outputs $y$ given the same input $x$. This structure performs a normalization, confining the values within the interval $[0, 1]$ and thereby satisfying the basic requirement for forming a probability distribution.

Although we do not know the exact form of $\pi_{r}(y\mid x)$, we do have the precise expression for the reference distribution $\pi_{\text{ref}}(y\mid x)$. Leveraging this, one can approximate the distribution $\pi_{r}(y\mid x)$ by feeding the input $x$ into the reference model and either iterating over all possible $y$ or sampling a sufficient number of $y$ values. Note, however, that this approach poses computational challenges in practice, which we will discuss further later on.

Continue with an Equivalent Transformation of the PPO Loss:

\[\underset{\pi_{\theta}}{\min}\, \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_{\theta}(y\mid x)}\Bigl[\log \tfrac{\pi_{\theta}(y\mid x)} {\tfrac{\pi_{\text{ref}}(y\mid x)\,\exp\!\Bigl(r_{\phi}(x,y)/\beta\Bigr)}{\,Z(x)\,}\,Z(x)}\Bigr].\]

Simplify:

\[\underset{\pi_{\theta}}{\min}\, \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_{\theta}(y\mid x)} \Bigl[\log \tfrac{\pi_{\theta}(y\mid x)}{\pi_{r}(y\mid x)} \;-\;\log Z(x)\Bigr].\]

Ignore the $\log Z(x)$ Term (as It Is Independent of $\pi_{\theta}$) to Obtain:

\[\underset{\pi_{\theta}}{\min}\, \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_{\theta}(y\mid x)} \Bigl[\log \tfrac{\pi_{\theta}(y\mid x)}{\pi_{r}(y\mid x)}\Bigr].\]

Express This in the Form of a KL Divergence:

\[\underset{\pi_{\theta}}{\min}\,\mathbb{E}_{x \sim \mathcal{D}}\Bigl[ \mathbb{D}_{KL}\Bigl(\pi_{\theta}(y\mid x)\,\Big\|\,\pi_{r}(y\mid x)\Bigr)\Bigr].\]

At this point, our objective is reduced to focusing solely on the KL divergence. Since the KL divergence is always non-negative and equals zero only when the two distributions are identical, the optimal condition is achieved when $\pi_{\theta}(y\mid x)$ exactly equals $\pi_{r}(y\mid x)$. This provides us with an explicit solution: the optimal probability distribution under PPO is precisely $\pi_{r}(y\mid x)$.

In other words, if the parameters of the reward model $r_{\phi}$ are fixed, then the optimal solution for PPO is

\[\pi_{r}(y\mid x) =\;\frac{\pi_{\text{ref}}(y\mid x)\,\exp\!\Bigl(\tfrac{r_{\phi}(x,y)}{\beta}\Bigr)}{Z(x)}.\]

In practice, however, the reward function $r(x,y)$ used in alignment training is not arbitrarily chosen; it is obtained through data-driven training to yield an optimal reward model. That is, we first train an ideal reward model $r_{\phi}^{\ast}$ (analogous to a coach scoring a game), and then, based on this optimal reward model, we further train an aligned model that “plays chess well.” Consequently, the optimal reward model $r_{\phi}^{\ast}$ and the aligned model it produces, $\pi^{\ast}_{r}$, continue to satisfy the relationship:

\[\pi^{\ast}_{r}(y\mid x) =\;\frac{\pi_{\text{ref}}(y\mid x)\,\exp\!\Bigl(\tfrac{r_{\phi}^{\ast}(x,y)}{\beta}\Bigr)}{Z(x)}.\]

Summary

First, we defined an overall optimization objective for aligning with human preferences. This objective function assumes that we have a reward function and seeks to find an aligned model that maximizes this objective. It can be understood as, in a game of chess, striving to adopt a strategy (a way of playing) that maximizes your win rate based on the coach’s (reward function’s) evaluation. Mathematically, this objective is written as:

\[\max_{\pi_{\boldsymbol\theta}} \Biggl\{ \mathbb{E}_{x,\, y_{\text{w}},\, y_{\text{l}} \sim D} \Biggl[ \log \sigma \Biggl( \beta \log \frac{\pi_{\boldsymbol\theta} (y_{\text{win}} \mid x)}{\pi_\text{ref} (y_{\text{w}} \mid x)} - \beta \log \frac{\pi_{\boldsymbol\theta} (y_{\text{l}} \mid x)}{\pi_\text{ref} (y_{\text{lose}} \mid x)} \Biggr) \Biggr] \Biggr\}.\]

Next, starting from this optimization objective, we derived the explicit solution for the aligned model $\pi$ when the reward function $r$ is fixed. Analogous to finding the optimal move in a chess game from historical records, this explicit solution is given by:

\[\pi^{\ast}(y \mid x) = \frac{\pi_{\text{ref}}(y \mid x)\,e^{r_{\phi}(x,y) / \beta}}{Z(x)},\]

where the normalization partition function $Z(x)$ is defined as

\[Z(x) = \sum\limits_{y} \pi_{\text{ref}}(y \mid x)\,e^{r_{\phi}(x,y) / \beta}.\]

Finally, in practical training, we typically do not train the reward model in isolation. Instead, we train the aligned model directly under the guidance of the optimal reward model $r_{\phi}^{\ast}$. In other words, just as continuous practice and coach feedback eventually yield a strategy that truly wins games, we adjust the above formula slightly to obtain:

\[r_{\phi}(x,y) = \beta \log\frac{\pi^{\ast}(y \mid x)\,Z(x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x).\]

These three steps illustrate the entire process—from defining the overall alignment objective, to deriving the optimal policy, and finally linking the reward model with the aligned model. The entire procedure is akin to first establishing the criteria for winning a game (the optimization objective), then deducing the best moves from historical records (the explicit policy solution), and ultimately, through continuous practice and feedback (reward model training), arriving at a strategy that both evaluates correctly and performs effectively.

5.2 Skipping the Training of the Reward Model

Although we have formally derived

\[\pi_{r}(y\mid x) =\;\frac{\pi_{\text{ref}}(y\mid x)\,\exp\!\Bigl(\frac{r_{\phi}(x,y)}{\beta}\Bigr)}{Z(x)},\]

in practice this explicit solution is difficult to apply directly because:

Estimating $Z(x)$ is challenging: It requires either exhaustively enumerating or adequately sampling all possible responses $y$ for a given prompt $x$ to compute and accumulate $\exp!\bigl(r_\phi(x,y)\bigr) \cdot \pi_{\text{ref}}(y\mid x)$, which is extremely costly.
Our original goal was to bypass training the reward model: We intended to learn an aligned model in one step rather than first training a reward function. However, $\pi_{r}$ still depends on the reward function $r$, which is a step away from our desired outcome of directly “learning to play chess well” without an intermediate scoring phase.

This leads us to consider the inverse: if we have the optimal aligned model $\pi^{\ast}$, can we derive its corresponding reward function $r_{\phi}^{\ast}$? The answer is affirmative. Starting from

\[\frac{\pi^{\ast}(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)} =\;\exp\!\Bigl(\frac{r_{\phi}^{\ast}(x,y)}{\beta}\Bigr),\]

we can equivalently transform it into:

\[r_{\phi}^{\ast}(x,y) =\;\beta\,\log\!\Bigl(\frac{\pi^{\ast}(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)}\Bigr)\;+\;\beta\,\log Z(x).\]

This expression allows us to represent $\pi^{\ast}$ in terms of $r^{\ast}$, thereby bridging the gap between training the reward model and the aligned model.

Summary of this section:

Since we can express the optimal reward model $r_{\phi}^{\ast}$ in terms of the optimal aligned model $\pi^{\ast}$ (just as repeated gameplay and analysis can help you deduce the best evaluation criteria for chess moves), we can directly substitute $\pi^{\ast}$ into the reward model’s training objective. In other words, while it may appear that you are training a reward model, you are in fact directly obtaining the optimal aligned model in one fell swoop. This achieves our original goal of “having a model that can both evaluate and play chess well.”

The remaining challenge then shifts to training the reward model itself. Typically, we adopt a preference-ranking approach for data annotation, analogous to having annotated chess game records that indicate which moves are superior. Generally, there are two main methods:

Generating Only Two Responses
For a given prompt $x$ (or chess position), generate two responses (moves), for example, <prompt x, chosen y1, reject y2>. Human annotations indicate which move is better, and our objective is to have the reward model assign a higher score to the chosen move while scoring the rejected move lower.
Generating K Responses (K > 2)
For the same prompt $x$, generate multiple responses (moves), such as <prompt x, y1, ..., yK>. Suppose human annotators provide a preference ranking $\tau$ (e.g., y2 > y3 > y1 > ... > yK). We aim for the reward model to give the highest overall score to the true ranking $\tau$ while scoring any other ordering lower.

In some training frameworks (such as ChatGPT’s implementation), when more than two responses are generated, the system breaks them down into pairwise comparisons so that the objective remains consistent with the two-response case. In more general scenarios, however, the entire set of preference rankings is treated as a whole, with the expectation that the true ranking $\tau$ receives the highest score. The derivation of DPO is based on this holistic view of preference ranking, and in the following sections we will derive the final DPO objective function separately for the cases of $K=2$ and $K>2$.

BT Model: Generating Only Two Responses

Imagine that while playing chess your coach shows you only two candidate moves—one labeled “good” (chosen) and the other “bad” (reject). In this case, your goal is for your “scoring system” (the reward model) to evaluate the good move as significantly superior to the bad one. To model this, we can employ the classic Bradley-Terry (BT) model, originally proposed in 1952 to analyze pairwise comparisons, and widely used in sports, market research, and other fields. For a pair of items $y_1$ and $y_2$ (here representing the chosen and rejected responses), the BT model expresses the probability that $y_1$ beats $y_2$ as:

\[P(y_1 > y_2 \mid x) \;=\; \frac{e^{\lambda y_1}}{e^{\lambda y_1} + e^{\lambda y_2}},\]

where $\lambda y_1$ and $\lambda y_2$ denote their respective strength parameters—analogous to using historical win rates to gauge the strength of chess players. In our context, when $y_1$ and $y_2$ correspond to the chosen and rejected responses, these parameters can be interpreted as the scores given by the reward model.

Our objective is to maximize the probability that $y_1$ beats $y_2$ across the entire annotated dataset $\mathcal{D} = \left{x^i, y_w^i, y_l^i \right}_{i=1}^N$. That is, we want the chosen responses to consistently outperform the rejected ones. Accordingly, the overall optimization objective for the reward function can be formulated as:

\[\begin{aligned} L(r_{\phi}, \mathcal{D}) &= -\,\mathbb{E}_{x,\,y_w,\,y_l \in \mathcal{D}}\Bigl[\log\bigl(P(y_w > y_l \mid x)\bigr)\Bigr] \\ &= -\,\mathbb{E}_{x,\,y_w,\,y_l \in \mathcal{D}}\Bigl[\log\!\Bigl(\frac{e^{r(x, y_w)}}{\,e^{r(x, y_w)} + e^{r(x, y_l)}\,}\Bigr)\Bigr] \\ &= -\,\mathbb{E}_{x,\,y_w,\,y_l \in \mathcal{D}}\Bigl[\log\!\Bigl(\frac{1}{1 + e^{- \bigl(r(x, y_w) - r(x, y_l)\bigr)}}\Bigr)\Bigr] \\ &= -\,\mathbb{E}_{x,\,y_w,\,y_l \in \mathcal{D}}\Bigl[\log\!\Bigl(\sigma\bigl(r(x, y_w) - r(x, y_l)\bigr)\Bigr)\Bigr]. \end{aligned}\]

The final expression above is precisely the reward model optimization objective employed in systems like ChatGPT. Intuitively, this objective encourages the reward model to output scores such that the chosen response clearly outperforms the rejected one.

Assuming we have found the optimal reward model of the form

\[r_{\phi}^{\ast}(x,y) \;=\; \beta\,\log\!\Bigl(\frac{\pi^{\ast}(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)}\Bigr) \;+\; \beta\,\log Z(x),\]

substituting this optimal reward function into our earlier objective yields:

\[\begin{aligned} & \text{Reward Model Loss} \\& =\underset{\pi^{\ast}}{\max}\,\Biggl\{\mathbb{E}_{(x,y_{\text{w}},y_{\text{l}}) \sim \mathcal{D}} \Biggl[\log \sigma\Bigl( \beta\,\log\frac{\pi^{\ast}(y_{\text{win}} \mid x)}{\pi_{\text{ref}}(y_{\text{win}} \mid x)}+\beta\,\log Z(x) \;-\; \beta\,\log\frac{\pi^{\ast}(y_{\text{loss}} \mid x)}{\pi_{\text{ref}}(y_{\text{loss}} \mid x)}-\beta\,\log Z(x)\Bigr)\Biggr] \Biggr\} \\[6pt]& =\underset{\pi_{\theta}}{\max}\,\Biggl\{\mathbb{E}_{(x,y_{\text{w}},y_{\text{l}})\sim \mathcal{D}} \Biggl[\log \sigma\Bigl( \beta\,\log\frac{\pi^{\ast}(y_{\text{win}} \mid x)}{\pi_{\text{ref}}(y_{\text{win}} \mid x)}-\beta\,\log\frac{\pi^{\ast}(y_{\text{loss}} \mid x)}{\pi_{\text{ref}}(y_{\text{loss}} \mid x)}\Bigr)\Biggr] \Biggr\}. \end{aligned}\]

This result shows that the reward model’s training objective has been transformed into one that depends solely on the aligned model $\pi$. In effect, by this method, we can bypass separately training the reward model and directly use the annotated pairwise preference data to train the aligned model $\pi_\theta$ in one fell swoop—just as you would learn the best moves directly from chess records. Thus, after a minor adjustment and setting our trainable aligned model as $\pi_\theta$, the final objective becomes:

\[\begin{aligned} & \text{Reward Model Loss} \\& =\underset{\pi_{\theta}}{\max}\,\Biggl\{\mathbb{E}_{(x,y_{\text{w}},y_{\text{l}})\sim \mathcal{D}} \Biggl[\log \sigma\Bigl( \beta\,\log\frac{\pi_{\theta}(y_{\text{win}} \mid x)}{\pi_{\text{ref}}(y_{\text{win}} \mid x)}-\beta\,\log\frac{\pi_{\theta}(y_{\text{loss}} \mid x)}{\pi_{\text{ref}}(y_{\text{loss}} \mid x)}\Bigr)\Biggr] \Biggr\} \\[6pt]& = \text{DPO Loss}. \end{aligned}\]

PT Model: Generating K (K > 2) Responses

Now, imagine a chess match where, instead of considering just two candidate moves, you evaluate multiple possible moves simultaneously. For example, at a critical moment your coach might present you with K different moves and ask you to rank them based on their potential outcomes. This scenario corresponds to the RLHF setting in which K responses are generated and ranked by preference.

Unlike the BT model, which focuses on pairwise comparisons when only two responses are generated, here we use a statistical approach based on the Plackett-Luce model, which can rank multiple candidates. Let $\tau$ denote the ground-truth ranking provided by human annotators—i.e., the ideal ranking of moves determined by the coach. We want this true ranking $\tau$ to outperform any other possible ordering. To this end, we define the probability that the true ranking $\tau$ beats all other orderings as:

\[P(\tau \mid y_1, y_2, \dots, y_K, x) =\;\prod_{k=1}^{K}\frac{e^{r(x, y_{\tau_k})}}{\sum_{j=k}^{K} e^{r(x, y_{\tau_j})}}.\]

Here, $\tau_k$ denotes the k-th move in the true ranking $\tau$ (for instance, $\tau_1$ is the most preferred move, $\tau_2$ is the second best, etc.), and $x$ represents the current game state (or prompt). Intuitively, we expect the top-ranked move ($\tau_1$) to score highest among all candidates; the second-ranked move ($\tau_2$) should score highly among the remaining moves; and so on.

Next, we substitute the optimal reward function $r_{\phi}^{\ast}(x,y)$ into the above equation. First, express $r_{\phi}^{\ast}(x,y)$ as

\[r_{\phi}^{\ast}(x,y) =\;\beta \,\log\!\Bigl(\frac{\pi^{\ast}(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)}\Bigr) \;+\;\beta\,\log Z(x),\]

then the probability becomes

\[P(\tau \mid y_1, y_2, \dots, y_K, x) = \prod_{k=1}^{K}\frac{e^{r^*_\phi(x, y_{\tau_k})}}{\sum_{j=k}^{K} e^{r^*_\phi(x, y_{\tau_j})}}.\]

We then express $r_{\phi}^{\ast}$ in terms of $\pi^{\ast}$ (note: here $Z(x)$ can be regarded as a normalization constant independent of $\pi$ and omitted in subsequent analysis), so that the expression can be rewritten as

\[P(\tau \mid y_1, y_2, \dots, y_K, x) = \prod_{k=1}^{K}\frac{\exp\!\Bigl(\beta\,\log\frac{\pi^{\ast}(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)}\Bigr)}{\sum_{j=k}^{K}\exp\!\Bigl(\beta\,\log\frac{\pi^{\ast}(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)}\Bigr)}.\]

Finally, for the entire dataset, we desire that the average probability of the true ranking $\tau$ is as high as possible—in other words, our goal is to maximize the probability of the true ranking over the dataset. Thus, for the multi-response case, the DPO objective function can be written as:

\[\mathcal{L}_{\text{DPO}}(\pi_\theta, \pi_{\text{ref}}) =\;-\;\mathbb{E}_{\tau,\,y_1,\,y_2,\,\dots,\,y_K,\,x \sim D}\!\left[\log \prod_{k=1}^{K}\frac{\exp\!\Bigl(\beta\,\log\frac{\pi_\theta(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)}\Bigr)}{\sum_{j=k}^{K}\exp\!\Bigl(\beta\,\log\frac{\pi_\theta(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)}\Bigr)}\right].\]

References:

5.3 Limitations of DPO

Having derived the mathematical foundations of DPO, we now turn to its limitations. Much like in chess, where merely studying game records without playing does not guarantee that you will play well, DPO aims to train the model to evaluate responses using the reward model rather than directly replicating PPO. This means that during training, the data and loss function are completely aligned with the reward model.

This raises a critical question: Does the model have the capacity to improve both its “evaluation” and “generation” abilities simultaneously? In other words, DPO’s training process focuses solely on teaching the model how to “score” responses, akin to learning how to assess moves from game records. However, this does not necessarily ensure that the model will make optimal decisions during actual generation—it does not prove that evaluation ability will seamlessly translate into effective generative performance. If this assumption fails, then the DPO training process loses its purpose. Just as one does not expect to become a chess master solely by reading game records, the validity of this assumption also directly impacts the rationale behind other methods like SPIN and self-reward.

Moreover, because the DPO optimization objective depends exclusively on the reward model’s scoring, it only cares about whether the change in scores relative to the reference model meets expectations, not whether the generated sentences are fluent or appealing. In other words, DPO focuses more on enlarging the loss margin rather than on ensuring that the model generates high-quality outputs. This can lead to an awkward phenomenon during DPO training: the losses for both good and bad responses may increase simultaneously, forcing us to adjust hyperparameters or add additional constraints to stabilize training.

From another perspective, the limitations of DPO can be summarized as follows:

Disconnection Between Evaluation and Generation
DPO’s training process teaches the model only to “evaluate”—acting like a static chess move scoring system—without incorporating the online generation process necessary for actual play. In contrast, PPO learns through online generation and trial-and-error, transforming evaluation ability into generative ability. Without this online exploration, a model trained with DPO might score well on offline data yet perform poorly during real generation.
Limitations of Offline Training
RLHF is fundamentally an online learning method, as it requires continuous correction of the model’s existing knowledge—much like a chess player must practice regularly to hone their skills. DPO, however, is entirely offline; it forces the model to rely solely on what annotators deem “correct” (e.g., the best moves from game records) and follow a predetermined optimal path, leaving little room for exploration. In practice, techniques such as initial supervised fine-tuning (SFT) on preferred responses or augmenting the preference data with diverse outputs are often used to introduce some elements of online learning and exploration.
High Data Quality Requirements
Since DPO training is entirely dependent on offline preference data, its effectiveness is highly sensitive to the quality and coverage of this data. If the training data is not comprehensive or does not match the actual generative distribution, the model may generate responses with the correct relative proportions of positive and negative examples, but the absolute probabilities could be diluted, or even bizarre outputs might emerge that are not present in the training data. For example, in a Q&A scenario, if the positive sample is “Pasta should be mixed with tomato meat sauce” and the negative sample is “Pasta should be mixed with chili oil,” the model after DPO optimization might output “Pasta should be mixed with concrete grade 42.” Such deviations underscore the critical importance of data quality.

In summary, the limitations of DPO stem from its exclusive focus on transferring the reward model’s “scoring” capability to the aligned model, without incorporating actual online generation and exploration. This is analogous to learning chess moves solely by studying game records without playing; consequently, even if you can theoretically evaluate every move accurately, it does not guarantee that you will make the best decisions during live play. Thus, although DPO can directly train an aligned model in one step, its performance often falls short compared to a complete RLHF system (i.e., Reward Model + PPO) if it lacks the complement of online generation and exploration.

大语言模型 RLHF 全链路揭秘：从策略梯度、PPO、GAE 到 DPO 的实战指南

欢迎来到这篇博客！如果你对大语言模型（LLM）的强化学习（RLHF）感兴趣，又想从最基础的策略梯度优化一路了解、推导出 PPO、GAE，再深入探讨 DPO，那你就来对地方了。

本博客将从最基础的 Gradient Policy Optimization 开始，逐步介绍经典的 REINFORCE 算法，再讲解如何利用剪切目标实现近端策略优化（PPO），并通过广义优势估计（GAE）在偏差与方差之间找到最佳平衡。之后，我们还会从头推导、讨论离线训练方法，如 DPO，帮助你了解不同训练路线的优势与挑战。

1. 在线 (On-Policy) 和离线 (Off-Policy) 强化学习

如今，LLM 中主流的 RLHF 方向分为两大路线：

以 PPO 为代表的 On-Policy 路线
以 DPO 为代表的 Off-Policy 路线

那么，什么是 On-Policy，什么是 Off-Policy 呢？可以用一个简洁的判定方法：

On-Policy：训练过程中，需要模型亲自参与“生成”来收集新的数据样本。
Off-Policy：训练过程中，不需要“在线”生成，更多依赖事先收集到的（或由别的策略产生的）数据进行离线学习。

一般来说，On-Policy 的方法在训练时会更“耗卡”、更耗时——最大的开销主要是源自“模型生成”这一步，因为对一个生成式任务而言，模型需要逐 token 地输出，这个过程极其耗费算力。不过，尽管速度较慢，On-Policy 在理论上拥有更高的效果上限，因为它能够不断根据当前模型状态进行探索和更新，这一点将在后续讨论 PPO 时更加凸显。

我们首先来谈谈 On-Policy 路线。On-Policy 的核心思路是：让模型自己产出答案，然后依据答案的优劣来打分，以此指导下一步的参数更新。简而言之，最关键的一点是让模型“亲自下场”。

假设你是一个需要学会下象棋的模型，现在有两种训练方式：

方式一：让你真刀真枪地下棋，每一步都有教练跟在你身边打分。当你吃掉对手棋子时，教练会鼓励你；当你因为冲动失误被对面反杀时，教练会及时提醒你改进。
方式二：给你一堆职业选手的比赛录像和一堆臭棋篓子的对局，用标注告诉你哪些操作是好招，哪些操作是坏招，然后你被动地学这些好操作、避免坏操作。

这两种方式最大的区别就在于：你有没有亲自去“下棋”。方式一就是 On-Policy，需要模型自己产出行为，然后学习；方式二就是 Off-Policy，只需根据已有对局数据进行模仿式学习。

Off-Policy 在训练时通常更快，因为它用现成的数据就可以了，不需要模型时时在线生成并等待打分，但也很依赖这批数据与当前模型能力的“匹配度”。如果数据中操作难度和模型水平相差太大（过高或过低），学习效果就可能大打折扣；On-Policy 则可以避免这一问题，因为它所得到的训练样本 100% 来自于自己当前的水平和行动。

在语言模型场景中，一个典型的 On-Policy 算法往往包含以下组件：

Actor：负责“生成”句子的模型（就像正在对弈的你）。
Critic：类似于“教练”，为每个生成结果提供即时指导；它本身也在训练过程中随 Actor 的能力变化而调整。
Reward Model：相当于“裁判”，给出最终分数或偏好评估。通常在训练过程中是固定不动的。
Reference Model：PPO 在大模型里的“独有角色”，用来防止 Actor 过度偏离原有预训练分布，缓解 reward hacking 等问题。

由于在大型 LLM 上，这四个部分的参数量都可能非常庞大（往往需要同时加载多个 70B 参数规模的模型），所以 On-Policy 训练往往带来极高的算力需求，这也是为什么人们通常说 PPO“非常耗卡”的原因。

下一步，我们将把目光聚焦在当前 On-Policy 路线最具代表性的方法——PPO 上，看看它究竟如何在实践中平衡训练开销与学习效率。

2. PPO（近端策略优化）

2.1 从策略梯度优化（Policy Gradient Optimization）谈起

想象一下，你是一名刚开始学习下象棋的新手。你的目标是通过不断调整你的下棋策略（记作 $\pi_{\theta}$，其中 $\theta$ 表示你的策略参数），来提高在一局棋中获得胜利的概率，也就是最大化你的期望回报。我们可以将每一盘棋看作是一条轨迹 $\tau$，而你要做的，就是通过不断优化你的策略来获得更高的回报。

更一般得，强化学习的目标就是去优化一个策略，使得回报的期望最大：

\[\pi^* = \arg \max_\pi J(\pi)\]

形式上，这个策略的回报被定义在所有可能的轨迹上：

\[J(\pi_{\theta}) = \int_\tau P(\tau \mid \pi) R(\tau) = \mathbb{E}_{\tau \sim \pi} [R(\tau)]\]

所谓的轨迹，就是一连串状态和对应动作的组合 $(\text{state}, \text{action})$：

\[\tau = (s_0, a_0, s_1, a_1, \dots)\]

在下棋这个例子中，状态 $s_t$ 可以理解为当前棋盘落子的状态，而动作 $a_t$ 即为下一次落子的地方。而当前时间点的下一个状态，则服从某种概率分布，可以被看作是随机的、不确定的（即对手落子）：

\[s_{t+1} \sim P(\cdot \mid s_t, a_t)\]

那么一个轨迹 $\tau$ 的概率则为：

\[P(\tau \mid \pi) = \rho_0(s_0) \prod_{t=0}^{T-1} P(s_{t+1} \mid s_t, a_t) \pi(a_t \mid s_t)\]

在强化学习中，我们会不断提到回报会随着时间不断打折 (discount reward) 的概念：未来的回报总是不如当下的回报那么重要。所以一个策略 $\tau$ 的总回报可以被视作：

\[R(\tau) = \sum_{t=0}^\infty \gamma^t r_t\]

其中 $\gamma \in [0, 1]$ 是时间上的折扣因子，而 $r_t$ 是 $t$ 时刻的实际回报.

在深度学习中，我们通常采用最小化损失函数来更新参数，这正是随机梯度下降（Stochastic Gradient Descent）的做法。但在这里，我们的目标是最大化回报，因此我们使用随机梯度上升（Stochastic Gradient Ascent）来更新策略：

\[\theta_{k+1} = \theta_k + \alpha \nabla_{\theta} J(\pi_{\theta}) \big|_{\theta_k}\]

这里的 $\nabla_{\theta} J(\pi_{\theta})$ 就被称为策略梯度（policy gradient）。换句话说，就好比每盘棋结束后，你会复盘，评估自己每一步走法对最终胜负的贡献，然后调整下一盘棋的策略。这样的更新方法统称为策略梯度算法（policy gradient algorithms）。

然而，正如在下棋时要考虑所有可能的走法和局面一样，精确计算这个梯度需要对所有可能棋局（轨迹）进行求和或积分，而这在实际中（除非棋盘极其简单）是计算上不可行的，因为即使你拥有可导的 $R(\tau)$，由于轨迹的步数太多，在使用 auto-differentiation 求导过程中会因为 memory 太大而使用非常受限。因此，我们需要仔细思考一下怎么求这个策略梯度。

策略梯度的推导

为了得到一个可操作的策略梯度公式，就像在复盘中总结经验一样，我们从目标函数的梯度开始推导。将每盘棋视为一条轨迹 $\tau$，目标函数梯度为：

\[\nabla_{\theta}J(\pi_{\theta}) = \nabla_{\theta} \operatorname{E}_{\tau \sim \pi_{\theta}} [R(\tau)]\]

第 1 步：展开期望（Expand the Expectation）

这一步相当于考虑所有可能的棋局，我们将期望展开为对所有轨迹的积分：

\[= \nabla_{\theta} \int_{\tau} P(\tau\mid\theta) R(\tau) \, d\tau\]

第 2 步：交换梯度与积分（Interchange Gradient and Integral）

就像把每一步棋的影响拆分出来，我们将梯度操作符移入积分内部：

\[= \int_{\tau} \nabla_{\theta} P(\tau\mid\theta) R(\tau) \, d\tau\]

第 3 步：使用对数导数技巧（Apply Log-Derivative Trick）

利用一个数学技巧（对数导数），类似于在复盘中分解每一步的重要性，我们有：

\[= \int_{\tau} P(\tau\mid\theta) \nabla_{\theta} \log P(\tau\mid\theta) \cdot R(\tau) \, d\tau\]

第 4 步：回到期望形式（Return to Expectation Form）

最终，我们可以把上面的积分重新写成期望的形式：

\[= \operatorname{E}_{\tau \sim \pi_{\theta}} \left[ \nabla_{\theta} \log P(\tau\mid\theta) \cdot R(\tau) \right]\]

分解 $\nabla_{\theta} \log P(\tau\mid\theta)$

在下棋的过程中，每盘棋的走法取决于你每一步的决策。假设一盘棋的轨迹 $\tau$ 可表示为：

\[P(\tau\mid\theta) = \rho_0(s_0) \prod_{i=0}^{T-1} P(s_{i+1}\mid s_i, a_i) \pi_{\theta}(a_i\mid s_i)\]

这里 $\pi_{\theta}(a_i\mid s_i)$ 就是你在棋局某一时刻（状态 $s_i$）下选择某一步棋（动作 $a_i$）的概率。取对数后求梯度，我们得到：

\[\nabla_{\theta} \log P(\tau\mid\theta) = \sum_{i=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_i\mid s_i)\]

（注意：棋局中对手的反应 $P(s_{i+1}\mid s_i, a_i)$ 由规则决定，与 $\theta$ 无关，因此其梯度为零。也就是说，当$s_i, a_i$给定时，$s_{i+1}$ 也就定下来了。）

2.2 最终策略梯度公式（Final Policy Gradient Formula）

把上面的结果代入期望，我们最终得到的公式是：

\[\nabla_{\theta} J(\pi_{\theta}) = \operatorname{E}_{\tau \sim \pi_{\theta}} \left[ \sum_{i=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_i\mid s_i) \cdot R(\tau) \right]\]

在这条公式中，每一步棋的决策（$\log \pi_{\theta}$）决定了整盘棋的表现，而不依赖于对手的固定规则。实际操作中，我们通常使用蒙特卡洛抽样来近似这个期望，就好比你通过大量实战积累经验来提升下棋水平。最后，基于采样的策略梯度可以由一下式子近似：

\[\hat{g} = \frac{1}{\mathcal{D}} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^T \nabla_{\theta} \text{log} \pi_\theta (a_t \mid s_t) R(\tau)\]

如果你仔细观察这个式子，你会发现很有意思的两个地方。首先 $R(\tau)$ 直接出现在策略参数的梯度里边。

2.3 REINFORCE 算法流程与实现步骤

下面介绍经典的策略梯度方法——REINFORCE 算法，它就像你通过不断下棋、复盘总结经验来不断改进你的棋艺：

策略网络构建
搭建一个神经网络来定义你的下棋策略 $\pi_{\theta}$：
- 输入：当前棋局状态 $s_t$
- 输出：根据当前棋局生成下一步棋的概率分布 $P(a_t \mid s_t)$
轨迹采样
用当前策略进行对局（采样轨迹 $\tau$），并记录每步棋得到的奖励（例如赢棋后的奖励分数）。
- 你可以设定每盘棋固定步数（比如 100 步），或直到比赛结束。
梯度计算
根据收集到的对局数据集 $\mathcal{D}$ 计算梯度估计，就像总结每盘棋中各步对胜负的贡献：
\[\hat{g} = \frac{1}{\mid\mathcal{D}\mid} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t\mid s_t) \, R(\tau)\]
参数更新
使用随机梯度上升法更新你的策略参数，就好像根据复盘结果调整你的下棋风格：
\[\theta_{k+1} = \theta_k + \alpha \hat{g}\]
或者写作：
\[\theta_{k+1} = \theta_k + \alpha \nabla_\theta J(\pi_\theta) \big|_{\theta_k}\]
循环优化
重复“下棋 - 复盘 - 调整”这一过程，直到你的策略收敛，即你能稳定地打出高水平的棋局。

核心公式说明

梯度估计公式

\[\hat{g} = \frac{1}{\mid\mathcal{D}\mid} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^T \underbrace{\nabla_\theta \log \pi_\theta(a_t\mid s_t)}_{\text{每步棋的决策梯度}} \cdot \underbrace{R(\tau)}_{\text{整盘棋的总奖励}}\]

这里，我们利用大量实战（蒙特卡洛抽样）来近似整个期望。
轨迹总奖励 $R(\tau)$ 就是整盘棋的胜负结果，用来衡量你所有决策的综合效果。

参数更新规则

\[\theta_{k+1} = \theta_k + \alpha \hat{g}\]

$\alpha$ 表示学习率，相当于每盘棋复盘后你调整策略的幅度；梯度的方向正是指向能够提升胜率的方向。

算法特性

关键优势: 这种方法完全依靠你下棋的实战经验，不需要提前知道对手的策略（model-free）。
计算要求: 需要大量的对局采样以降低随机性带来的波动（方差）。
改进方向: 后续方法（如 Actor-Critic）会引入价值函数参考线，使得策略更新更为稳定，就像在复盘中加入专业教练的点评一样，帮助你更快提高棋艺。

2.4 策略梯度优化面临的问题

策略梯度优化一个核心的假设是：我们可以通过采用的方法来估计策略的梯度。但是当问题的规模变得非常大：比如每次轨迹 $\tau$ 都非常长，又或者策略模型非常大，为了预估准确的梯度，我们就不得不采样多次，否则就会面临方差很高的问题。策略梯度算法中的梯度估计虽然在理论上是无偏的（即其期望值会收敛到真实梯度），但实际上它的方差非常高，该梯度估计可以写成：

\[\hat{g} = \frac{1}{\mid\mathcal{D}\mid} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t \mid s_t) \, R(\tau)\]

其中：

$\mid\mathcal{D}\mid$ 表示数据集 $\mathcal{D}$ 的大小，
$\pi_\theta$ 是当前策略（你的下棋策略），
$R(\tau)$ 是整盘棋（轨迹 $\tau$）的总回报，
$a_t, s_t$ 分别代表在时间步 $t$ 你采取的动作和所处的状态。

想象你在下棋。每一步你都希望知道自己的决策对最终胜负的贡献，但问题在于，如果你试图把整盘棋的输赢都归因于每一步决策，那么这种评估就会变得非常不稳定——也就是方差很高。接下来，我们将采取不同的做法来减小这样估计的方差。

2.5 减小方差：只关注未来

观察上边用于梯度估计得式子：无论当前在哪一步 $t$，$R(\tau)$ 总是会把整个轨迹中所有的 reward 都算进去。然后这么做是不太合理的，当前的决策应该只需要考虑对未来产生的影响：过去的已经无法改变了，无需再加入到 $R(\tau)$ 的计算中。

回到下棋的例子：假如每一步的评分都把前面已经走过的好步或坏步也计入进去，那就会混淆你当前决策的真实价值。实际上，在评估当前走法时，你只需要关注从这一步开始直到局末的“后续表现”。这就是所谓的“rewards to go”，即只考虑从当前动作开始到比赛结束所获得的奖励。

用数学表达就是，我们将原来的梯度估计调整为：

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \left( \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_{i,t} \mid s_{i,t}) \right) \left( \sum_{t'=t}^T r(s_{i,t'}, a_{i,t'}) \right)\]

这里，$\sum_{t’=t}^T r(s_{i,t’}, a_{i,t’})$ 就代表了从当前走法开始到局末的“奖励总和”。这样做就像你在复盘时，只关注从某一步开始后续的变化，而不纠结于那一步之前已经发生的事情。

因此，当我们把这些“来自过去的”的冗余项去除后，噪声就自然而然会减少一些。

2.6 减小方差：参考线（Baseline）

为了进一步减少评估中的波动，我们可以为每一步的“后续奖励”减去一个基准值。数学上，这个参考线通常记为 $b$（在实际中，我们常用价值函数 $V^\pi(s)$ 来作为这个参考线），公式为：

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \left( \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_{i,t} \mid s_{i,t}) \right) \left( \sum_{t'=t}^T r(s_{i,t'}, a_{i,t'}) - b \right)\]

其中 $b$ 就是所谓的参考线，他不一定是一个常数，更多时候是另外一个状态 $s_t$ 的函数。这个参考显得实际意义是，在当前的状态下，回报的期望大概是什么样子。那么所谓超出期望的部分 $\sum_{t’=t}^T r(s_{i,t’}, a_{i,t’}) - b$ 就是优势（Advantage）。在实际训练中，我们会用优势代替原来的奖励进行梯度估计，以减小方差。

在大语言模型的对齐训练中，我们通常在语言模型（即策略 $\pi_\theta$）的基础上增加一个额外的线性层，用来估计在某个状态下的预期回报 $ V^\pi(s) $。这相当于为每个局面设定了一个标准分，帮助我们衡量当前决策的实际优势。如果你想更直观得了解为什么需要这个参考线，可以阅读我的上一篇博客。

2.7 减小方差：引入 $ Q $ 和 $ V $

在上边，我们提到了 “rewards to go” 的概念，即 $\sum_{t’=t}^T r(s_{i,t’}, a_{i,t’})$。这个项在强化学习中被称为 Q 函数 （$Q^\pi(s, a)$），即在状态 $s$ 采取动作 $a$ 后，未来获得的总回报。然后，通过减去状态价值 $V^\pi(s)$ 我们得到 优势函数：

\[A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)\]

用下棋的比喻，Q 函数描述的是你当前这个局面 $s$ 在走了一步 $a$ 后可能的输赢，而状态价值表示的是仅凭现在的局面，你还有多少老本儿可以吃。如果当前局面对你十分有利，但是你走了一步臭棋，尽管你最后赢面还是很大(相当于 $Q^\pi(s, a)$ 的绝对大小还是很大)，但是你相对于对方的“优势”却减弱了。所以，你不仅应该关注某一步棋的绝对得分，还要和“老本儿”比较比较，看看这一步棋究竟为你增加了多少胜率。如果 $A^\pi(s, a)$ 为正，则说明这步棋明显扩大了你的优势；若为负，则表明你这一招不妙。

最终，我们可以将策略梯度写为：

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \left( \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_{i,t} \mid s_{i,t}) \right) A^\pi(s, a)\]

这公式正是衡量你在每个局面上，通过比对每一步棋与平均表现的差距，从而决定如何调整你的下棋策略。

2.8 优势函数的解释

简单来说，优势函数 $A^\pi(s, a)$ 就告诉你，在某个局面（状态 $s$）下，选择某个特定走法（动作 $a$）相比于平均走法能提升多少胜率。如果这步棋带来的预期回报远高于你当前的基准水平，那么这步棋的优势就是正的，说明它非常值得采用；反之，则说明不如平均水平。

总之，通过这些方法——只考虑“后续奖励”、引入参考线以及使用优势函数，我们就能在训练中有效降低梯度估计的方差，就像你在下棋时只关注关键走法对局面转变的影响，从而让策略更新更稳定、更有针对性。

2.9 如何估计优势项 - 使用基于迭代的 GAE 策略

我们可以用多种方式来估计优势项。例如：

上边的例子告诉我们，我们可以累加若干步来实现偏差和方差的权衡。

如果我们过早地停止累加真实的奖励项：就会产生 高偏差（high bias），因为只使用了对价值函数的小部分近似和极少的真实奖励。
如果我们累加过多的奖励项：则会引入 高方差（high variance），因为依赖更多真实采样会让估计量不稳定。

为平衡这一 偏差-方差问题，我们可以采用对这几项进行加权求和的做法，也就是 广义优势估计（Generalized Advantage Estimation, GAE）：

\[\delta_t = r_t + \gamma V^\pi(s_{t+1}) - V^\pi(s_t)\] \[\hat{A}_t = \delta_t + \gamma \lambda \hat{A}_{t+1}\]

这是一个 递归公式。末端时刻的优势估计可以看作第一种展开，而它的前一时刻会再加上一层衰减系数 $\lambda$。
通过在各时间步上不断迭代累加，就可以平衡真实奖励所带来的高方差和使用价值函数所带来的高偏差。

下一章会讲述 GAE 的详细推导。

在大语言模型对齐的场景中，这个结果会指示策略（语言模型）在给定某个提示（state）后，去提升那些在期望意义上“优于平均”奖励的下一个 token 选取概率。换言之，模型将倾向于选择那些更可能引导未来 token 合乎我们所希望的奖励标准（即更“对齐”或更符合训练数据分布）之序列。

2.10 PPO 损失函数（The PPO Loss）

在 PPO（近端策略优化）中，为了防止更新时策略变化过大，我们会构造一套特殊的损失函数。它主要由以下几部分构成：

策略损失（Policy Loss, $L_{\text{POLICY}}$）

\[L_{\text{POLICY}} = \min \Bigg( \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\text{old}}(a_t \mid s_t)} \hat{A}_t, \; \text{clip}\bigg(\frac{\pi_\theta(a_t \mid s_t)}{\pi_{\text{old}}(a_t \mid s_t)},\, 1 - \epsilon,\, 1 + \epsilon\bigg) \, \hat{A}_t \Bigg)\]

这一部分就像是在下棋时，你不希望一次改变策略太多，而是希望微调每一步的选择，保证既能改善局势，又不会因冒险走出常规而导致局面混乱。

价值函数损失（Value Function Loss, $L_{\text{VF}}$）

\[L_{\text{VF}} = \frac{1}{2} \left\lVert V_\theta(s) - \left( \sum_{t=0}^T \gamma^t r_t \;\middle|\; s_0 = s \right) \right\rVert_2^2\]

这一项帮助我们确保对于每个局面，你的预期回报估计（就像预判棋局发展）与实际获得的回报尽可能接近。

熵损失（Entropy Loss, $L_{\text{ENTROPY}}$）

\[L_{\text{ENTROPY}} = - \sum_x p(x) \log p(x)\]

熵损失鼓励策略保持一定的探索性，就像一个优秀的棋手不仅熟练掌握定式，同时也敢于尝试新变化，保持灵活应变的能力。

PPO 总损失（PPO Loss, $L_{\text{PPO}}$）

\[L_{\text{PPO}} = L_{\text{POLICY}} + c_1 L_{\text{VF}} + c_2 L_{\text{ENTROPY}}\]

将这些部分结合起来，就构成了 PPO 的总损失函数。这个损失函数旨在在更新策略时既提高胜率（奖励），又防止策略偏离原有风格过远，保持平稳而高效的改进。

2.11 使用 PPO 的优势

稳定性：剪切操作（clipping）确保策略更新时步伐不会过大，就像你在下棋时不会突然改变风格，保证每一步都稳扎稳打。
样本效率：PPO 能够较好地利用收集到的对局数据，尽管在大型模型的场景下仍需大量采样。
内在安全性：通过剪切更新和参考策略的 KL 惩罚，PPO 能有效防止模型在更新时出现剧烈偏差，从而确保生成结果不会与预训练风格南辕北辙。

总体来说，就像一位经验丰富的棋手在不断下棋、复盘、调整策略中不断进步一样，PPO 通过精细的梯度更新和对策略变化的限制，实现了稳健而高效的强化学习。

3. GAE（广义优势估计）理解及推导

想象一下你在参加一场国际象棋比赛。每一步棋不仅会直接影响当前局面，还可能对后续整个比赛产生深远影响。为了判断某一步棋到底给你带来了多少优势，你需要同时考虑这一步棋的即时得分和它对未来局势的潜在贡献。广义优势估计（Generalized Advantage Estimation, GAE）正是为了解决这一问题而设计的，它帮助策略评估在某个状态下采取某个动作比平均水平会带来多少优势，就像你在复盘时会评估某一步棋是否为你赢得了更高的胜率。

GAE 的创新之处在于结合了多步估计和时间差分（TD）估计的思想，利用一系列过去的误差来预测未来收益，从而在平衡偏差与方差之间达到最佳状态。PPO 与 GAE 联合使用时，就好比一位棋手在有了稳健的对局策略框架（PPO）的基础上，又利用精确的局面评估（GAE）来不断优化每一步的决策，使得整体表现更加稳定且高效。

3.1 引入残差的概念

在下棋时，你往往无法一开始就知道一招棋的全部价值，而只能根据下一步的预判来估计其效果。这个预判与实际结果之间的差异就叫做时间差分（TD）残差。

为什么需要残差？

设想在比赛中，你站在一个关键的分叉口，基于经验认为向左走可能更有利，但实际走后发现局面并不如预期。这种实际体验与原始预判之间的差异就是残差，它帮助你修正对未来局面的估计，使你在后续对局中做出更明智的选择。

残差的数学表达

对于策略梯度方法，价值函数的梯度可以写为：

\[\nabla R_{\theta} = \mathbb{E}_{(a_t, s_t)\sim \pi_\theta} \left [A^{\theta}(a_t, s_t) \nabla \log P_{\theta} (a_t \mid s_t) \right]\]

其中 $A^{\theta}(a_t, s_t)$ 表示在状态 $s_t$ 下采取动作 $a_t$ 能带来的未来收益（经过 baseline 校正后的优势），我们简称为 $A(t)$。如果简单地定义：

\[A(t) = r_t - V(s_t)\]

那么，为了更准确地反映从当前走法开始未来带来的全部影响，我们引入 TD 残差。具体来说，对于状态 $s_t$ 下采取动作 $a_t$，如果接下来的状态是 $s_{t+1}$ 并获得奖励 $r_t$，则定义 TD 残差为：

\[\delta_t = r_t - \big(V(s_t) - \gamma V(s_{t+1})\big)\]

$r_t$ 就像你下完这一步棋后获得的即时得分。
$V(s_t)$ 和 $V(s_{t+1})$ 分别是你对当前局面和下一局面预期的整体评分，相当于教练对局面的评价。
$\gamma$ 是折扣因子，表示未来奖励的重要性逐步减弱。

残差的直觉解释

假设你在下国际象棋，你当前的局面是 $s_t$，然后你选择了某个动作 $a_t$，导致游戏进入了新的局面 $s_{t+1}$，并且你获得了即时奖励 $r_t$。你的目标是估计这一步对未来的影响。

你对当前状态 $s_t$ 的预估价值是 $V(s_t)$，它代表你在这个状态下按照当前策略继续走下去能获得的预期收益。
你对下一步 $s_{t+1}$ 的预估价值是 $V(s_{t+1})$，它代表你在下一步之后能获得的预期收益。

理想情况下，你的当前状态的价值 $V(s_{t})$ 应该等于：

\[V(s_{t}) = r_t + \gamma V(s_{t+1})\]

也就是说，你的当前状态的价值 $V(s_{t})$ 应该等于你的即时奖励 $r_t$ 加上你折扣后的未来预期价值 $\gamma V(s_{t+1})$（因为未来的收益随着时间的推移会衰减，是衰减因子）。但实际上，你的 $V(s_{t})$ 可能并不等于这个理想的估计，而是有一定的误差。因此，我们定义TD 残差来表示这个误差。因此，$\delta_t$ 衡量了你当前走法与预期之间的差距，就好像你发现某步棋的实际效果比预期好或差，从而提供了调整策略的依据。

为了获得更准确的评估，就像你在复盘时不仅关注眼前一步，而是回顾接下来的几步对局势的影响，我们定义 $A^k(t)$ 表示从时刻 $t$ 开始向前看 $k$ 步的累计优势：

当 $k$ 趋于无穷时，有：

\[A^k(t) = \sum_{i=0}^{\infty} \gamma^{i} \delta_{t+i}\]

根据残差的定义，这可以展开为：

\[A^k(t) = r_t + \gamma r_{t+1} + \dots + \gamma^{k-1} r_{t+k-1} + \gamma^k V(s_{t+k}) - V(s_t)\]

这就好比你在下棋时，把从当前走法开始到比赛结束的所有实际得分都考虑在内。显然，考虑的步数越多（$k$ 越大），你对局面的评估偏差越小，但同时可能受到更多偶然因素的影响，导致方差增大；反之，若只看眼前几步，则偏差较大但方差较小。

3.2 偏差-方差的折衷（Bias-Variance Tradeoff）

在实际中，为了在偏差与方差之间找到最佳平衡，我们不会简单地选择某个固定的 $k$ 来累加所有奖励，而是采用指数加权的方式，这就是 GAE 的核心思想。

首先定义 TD 残差：

\[\delta_t = r_t + \gamma V^\pi(s_{t+1}) - V^\pi(s_t)\]

然后，我们用递归公式来累计优势：

\[\hat{A}_t = \delta_t + \gamma \lambda \hat{A}_{t+1}\]

这里，$\lambda \in [0,1)$ 就像你在下棋时决定对未来几步影响的重视程度：

当 $\lambda=1$ 时，你会尽可能考虑整盘棋的所有后续效果（偏差最小，但方差最大）。
当 $\lambda=0$ 时，你只关注当前一步（偏差最大，但方差最小）。

此外，我们也可以用加权求和的方式表示多步优势。定义：

\[A^{\text{GAE1}}(t) = A_t^1 + \lambda A_t^2 + \lambda^2 A_t^3 + \dots\]

展开得：

假设我们只考虑前 $k$ 项，根据等比数列求和公式，我们有：

当 $k \rightarrow \infty$ 时：

\[A^{\text{GAE1}}(t) = \delta_t \frac{1}{1 - \lambda} + \gamma \delta_{t+1} \frac{\lambda}{1 - \lambda} + \gamma^2 \delta_{t+2} \frac{\lambda^2}{1 - \lambda} + \dots\]

乘以 $1-\lambda$ 得：

\[(1-\lambda) A^{\text{GAE1}}(t) = \delta_t + \gamma\lambda \delta_{t+1} + \gamma^2\lambda^2 \delta_{t+2} + \dots\]

令 $A^{\text{GAE}}(t) = (1-\lambda) A^{\text{GAE1}}(t)$，最终我们得到：

\[A^{\text{GAE}}(t) = \delta_t + \gamma\lambda \delta_{t+1} + \gamma^2\lambda^2 \delta_{t+2} + \dots = \sum_{k=0}^{\infty} (\lambda \gamma)^k \delta_{t+k}\]

这正展示了如何通过调节 $\lambda$ 来权衡短期与长期收益的影响。正如一位棋手既不会只看眼前一步，也不会试图预判所有可能变化，而是用一个合适的比例权衡近远期影响一样：

当 $\lambda=1$ 时，优势估计包含了整盘棋的所有观察信息（偏差最小，方差最大）：
\[\begin{align} A^{\text{GAE}}(t) & = \sum_{k=0}^{\infty} \gamma^k \delta_{t+k} \\ & = A^k(t) = r_t + \gamma r_{t+1} + \dots + \gamma^{k-1} r_{t+k-1} + \gamma^k V(s_{t+k}) - V(s_t) \end{align}\]
当 $\lambda=0$ 时，只依赖于当前一步的信息（偏差最大，方差最小）：
\[A^{\text{GAE}}(t) = \delta_t = r_t - \big(V(s_t) - \gamma V(s_{t+1})\big)\]

综上，$\lambda$ 就是调节偏差和方差的重要超参数：

$\lambda$ 越大，考虑的后续观察越多，偏差越小，但方差越大；
$\lambda$ 越小，依赖当前预估越多，偏差越大，但方差越小。

这种方法就如同一位棋手在下棋时，不仅关注眼前一着的效果，同时用合理的权重考虑未来几步的走向，从而做出最优决策。

4. 用 PPO 训练 LLM 的 Token-per-Token 过程

下面以一种直白、逐 token 的方式讲解在使用 PPO 训练一个 LLM 时到底发生了什么，并说明其中涉及的参数（$\theta_{\text{old}}$ 与 $\theta$）的角色以及第一次更新的过程。

4.1 用 PPO 对齐 LLM

$\theta_{\text{old}}$（旧策略参数）：这是用来生成数据（比如生成 token 序列）的模型参数。在一次 PPO 更新前，你会用这套参数来采样（即生成 token），同时记录下每个 token 的生成概率（通常以 log-probability 的形式记录）。
$\theta$（当前策略参数）：这是正在被更新的模型参数。你通过 PPO 算法根据之前采样得到的数据来调整这组参数，使得模型生成的 token 更符合奖励信号。更新后，$\theta$ 就会和 $\theta_{\text{old}}$ 不一样。

可以把 $\theta_{\text{old}}$ 想象成“老版模型”，而 $\theta$ 则是“新版模型”，新版模型在一次训练迭代后会比老版模型更好（或者更符合奖励信号）。在每次更新后，新版模型会成为下一轮循环中的“老版模型”。

2. “Token-per-Token” 的训练过程

假设你已经有一个预训练的 LLM，我们把它的参数设为 $\theta_{\text{old}}$。下面是一个具体的过程说明：

（1）采样阶段

给定一个提示（prompt）
你将一个 prompt 输入 LLM（使用 $\theta_{\text{old}}$ 的参数）生成文本。
逐 token 生成
模型根据当前上下文（state）生成下一个 token（action）。例如，当生成第 t 个 token 时：
- 当前状态 $s_t$ 是 prompt 加上前面 t-1 个 token。
- 模型选择了 token $A_t$ 的动作，并给出了一个概率 $\pi_{\theta_{\text{old}}}(a_t \mid s_t)$（通常记录为 log-probability）。
记录数据
对于每个 token，你记录：
- 状态 $s_t$（上下文）
- 动作 $A_t$（生成的 token）
- 旧策略下的生成概率（或 log-prob）
- 可能还会记录奖励信息（比如通过一个奖励模型计算出的分数）和估计的价值 $V(s_t)$。
- 这样形成一条轨迹（sequence）——一系列 token 及其相关数据。

（2）计算优势（Advantage）

在 PPO 中，我们需要计算每个 token 的优势 $A_t$（可以采用 GAE，多步或者单步的方式）。

举例：对于第 t 个 token，我们可能计算：
\[A_t = \delta_t + \gamma \lambda\, \delta_{t+1} + \gamma^2 \lambda^2\, \delta_{t+2} + \cdots\]
其中 $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ 是单步时序差分误差。
在实际操作中，由于轨迹有限，我们只计算到轨迹结束的位置。

（3）更新阶段：从 $\theta_{\text{old}}$ 到 $\theta$

在采样得到一批 token 数据后，你开始用 PPO 算法更新模型参数。这里“token-per-token”的解释如下：

旧策略作为参照
你已经记录了每个 token 在生成时（使用 $\theta_{\text{old}}$）得到的 log-probability，即 $\log \pi_{\theta_{\text{old}}}(a_t \mid s_t)$。
当前策略重新计算概率
使用当前的模型参数 $\theta$（最初初始时 $\theta$ 和 $\theta_{\text{old}}$ 是一样的，但你会进行多次梯度更新后，$\theta$ 会发生变化），对同样的状态 $s_t$ 计算生成相同 token $a_t$ 的 log-probability，即 $\log \pi_\theta(a_t \mid s_t)$。
计算概率比
对于每个 token，计算概率比：
\[r_t(\theta) = \exp\Big( \log \pi_\theta(a_t \mid s_t) - \log \pi_{\theta_{\text{old}}}(a_t \mid s_t) \Big)\]
这个比值表示“新版模型”与“旧版模型”在该 token 处的生成倾向发生了多大变化。
构造 PPO Loss（每个 token 的损失）
PPO 的目标是限制新版和旧版的差异过大。对每个 token，根据优势 $A_t$ 和比值 $r_t(\theta)$ 计算损失：
\[L_t(\theta) = -\min\Big( r_t(\theta) A_t,\ \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \, A_t \Big)\]
直白理解：
- 如果当前策略与旧策略的概率比 $r_t(\theta)$ 和优势 $A_t$ 的乘积在可接受范围内（在 $[1-\epsilon, 1+\epsilon]$ 内），就用这个值。
- 如果比值太大或者太小，就用限制后的值（clip 后的值），防止更新步长过大。
对所有 token 取平均，更新参数 θ
将所有 token 的损失 $L_t(\theta)$ 平均，得到整个批次的 loss。然后用梯度下降（或其他优化器）更新模型参数 $\theta$。
- 这时，$\theta$ 开始和 $\theta_{\text{old}}$ 不一样了，也就是说新版模型已经从旧版模型上进行了“改进”。
更新 $\theta_{\text{old}}$
在一次完整的 PPO 更新（通常经过多个 epoch 在同一批数据上训练）后，你会把当前模型参数 $\theta$ 赋值给 $\theta_{\text{old}}$，为下一轮采样做好准备。

3. 伪代码

下面给出一个伪代码风格的算法块，帮助理解 token-per-token 的 PPO 更新过程：

# 初始化：预训练 LLM 的参数设为 θ_old，同时复制给 θ
θ_old = PretrainedLLM.parameters
θ = copy(θ_old)

# 采样阶段：使用 θ_old 生成一批数据
for each prompt in dataset:
    trajectory = []
    state = prompt
    while not end_of_sequence:
        token, logpi_old = θ_old.generate_token(state)
        # 记录当前状态、token 以及 θ_old 下的 log概率
        trajectory.append( (state, token, logpi_old, reward, V(state)) )
        state = state + token  # 更新状态（追加 token）
    store trajectory

# 计算优势（例如采用 GAE）
for each trajectory:
    for t from last token downto first:
        δ_t = reward[t] + γ * V(state[t+1]) - V(state[t])
        A_t = δ_t + γ * λ * A[t+1]  # 递推计算

# PPO 更新阶段：多轮 epoch
for each PPO update epoch:
    for each token data (state s_t, token a_t, logpi_old, A_t) in batch:
        # 1. 当前策略计算 log 概率
        logpi_current = θ.log_probability(s_t, a_t)
        # 2. 计算概率比
        r_t = exp( logpi_current - logpi_old )
        # 3. 计算未剪切与剪切目标
        loss_unclipped = r_t * A_t
        loss_clipped = clip(r_t, 1-ε, 1+ε) * A_t
        # 4. 每个 token 的损失取较小值（并加上负号，因为我们是最小化 loss）
        loss_token = -min(loss_unclipped, loss_clipped)
    # 5. 平均所有 token 的 loss，进行一次梯度更新
    θ = Update(θ, average(loss_token over batch))

# 更新完毕后，将 θ 复制给 θ_old，用于下一轮采样
θ_old = copy(θ)

4.2 在第一次更新时，如果新模型和旧模型一模一样，PPO Loss 是 0 吗？

当然不会，下面是原因。

1. PPO Loss 的形式回顾

PPO 的目标函数通常写成：

\[L(\theta) = \hat{\mathbb{E}}_t \left[-\min\Big( r_t(\theta) A_t,\; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t \Big) \right],\]

其中

$r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}$ 是新模型（参数 $\theta$）与旧模型（参数 $\theta_{\text{old}}$）在相同状态下生成同一 token 的概率比，
$A_t$ 是优势估计，反映了该 token 相对于基准（baseline）的“好坏”。

这个 Loss 的目的是引导模型朝着使得生成动作得到更高优势（奖励）的方向更新，而不是直接衡量新旧模型的参数差异。

2. 如果新模型和旧模型一模一样，损失是不是为零？

在第一次更新之前，我们通常将新模型参数 $\theta$ 初始化为旧模型参数 $\theta_{\text{old}}$。因此，对于每个 token，
\[r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)} = 1.\]
代入 PPO 目标函数，两项都会变为：
\[r_t(\theta) A_t = 1 \times A_t = A_t,\]
并且剪切操作对 1 没有影响（因为 $\text{clip}(1, 1-\epsilon, 1+\epsilon) = 1$），结果仍然是 $A_t$。
因此，每个 token 的目标函数就是 $\min(A_t, A_t) = A_t$，而 Loss 是加上负号，即：
\[L_t(\theta) = -A_t.\]

关键在于：优势 $A_t$ 并不必然为零。
优势 $A_t$ 是通过环境奖励和价值函数估计计算得到的，它反映了“这个 token 产生的结果比预期要好还是差”。即使新模型与旧模型完全一致，环境反馈（奖励）和价值函数的估计通常会使得 $A_t$ 非零。

3. 直观解释

采样阶段
当你用旧模型（$\theta_{\text{old}}$）生成 token 时，会记录每个 token 的生成概率、对应的奖励以及基于价值函数的估计。假设某个 token 在环境中获得了较高奖励，但模型当时对该 token 的预估不够乐观，那么它的优势 $A_t$ 就是正的；反之，如果奖励低于预期，优势 $A_t$ 就可能为负。
第一次更新
尽管初始时 $\theta$ 与 $\theta_{\text{old}}$ 完全一致（所以概率比 $r_t=1$），PPO Loss 的计算主要依赖于优势 $A_t$。因此，Loss 为：
\[L_t(\theta) = -A_t.\]
如果 $A_t$ 为正，则负号意味着 Loss 为负，反向传播时会鼓励模型增加生成这个 token 的概率；如果 $A_t$ 为负，则模型会减少生成该 token 的概率。
梯度更新的作用
这样，尽管新旧策略当前一致，优势 $A_t$ 的非零值会通过梯度计算推动模型调整生成策略，从而在下一次生成时更倾向于产生正优势的 token。

4. 总结

PPO Loss 的核心目标：
不是直接对比新旧模型参数的差异，而是基于旧策略采样数据上每个 token 的优势来指导模型更新。
即使在第一次更新时：
由于优势 $A_t$ 通常不为零，PPO Loss 会产生非零梯度，从而促使模型在更新过程中调整策略，使其生成结果更符合奖励信号。

这就是为什么即使在第一次更新前，新模型与旧模型完全一致，PPO Loss 也不会为零的原因。

5. DPO：看着棋谱学下棋

前面我们提到，PPO 很像你在真实棋盘上有一位教练随时指导，边对弈边在真实环境中改进策略（在线学习）；而 DPO 则更像你坐在家里研究一本棋谱（离线数据），通过已有的胜负对照来推断如何改进走法。本节就来具体推导 DPO（Direct Preference Optimization）的数学原理，解释它在和 PPO（或更一般的 RLHF 思路）对比时有何长处与不足。

在此之前，请先牢记下面 3 个关键的目标函数：$r_\boldsymbol\phi$ 表示奖励模型（Reward Model），$\pi_{\boldsymbol\theta}$ 是我们需要训练的对齐模型（策略），$\pi_{\text{ref}}$ 则是参考模型（无论是 PPO 还是 DPO，都需要它来保证策略别跑得太偏）。这 3 个目标函数分别是：

Reward Model Loss:
\[\max_{r_\boldsymbol\phi} \Bigl\{ \mathbb{E}_{x, y_{\text{w}}, y_{\text{l}} \sim \mathcal{D}} \bigl[\log\sigma(x, y_{\text{w}}) - \log\sigma(x, y_{\text{l}})\bigr]\Bigr\}\]
PPO Loss:
\[\max_{\pi_{\boldsymbol\theta}} \biggl\{ \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_{\boldsymbol\theta}(y\mid x)}\bigl[r_\boldsymbol\phi(x, y)\bigr] \;-\;\beta \,\mathbb{D}_{KL}\bigl[\pi_{\boldsymbol\theta}(y\mid x)\,\big\|\,\pi_{\text{ref}}(y\mid x)\bigr]\biggr\}\]
DPO Loss:
\[\max_{\pi_{\boldsymbol\theta}} \Bigl\{\mathbb{E}_{x,\,y_{\text{w}},\,y_{\text{l}} \sim \mathcal{D}} \Bigl[\log \sigma\Bigl(\beta \log\tfrac{\pi_{\boldsymbol\theta} (y_{\text{w}} \mid x)}{\pi_\text{ref}(y_{\text{w}} \mid x)} \;-\;\beta \log\tfrac{\pi_{\boldsymbol\theta} (y_{\text{l}} \mid x)}{\pi_\text{ref}(y_{\text{l}} \mid x)}\Bigr)\Bigr]\Bigr\}\]

其中 KL 散度定义为：

\[\mathbb{D}_{KL}(P\|Q) =\;\sum_i P(i)\,\log\!\Bigl(\tfrac{P(i)}{Q(i)}\Bigr).\]

5.1 从优化目标中直接求解最优对齐模型

让我们先从 PPO 的损失函数出发，对其进行数学变换。就像在真实棋盘上下棋时，教练（Reward Model）实时给你反馈、并用 KL 散度惩罚让你的策略别偏离参考模型太远。

代入 KL-散度公式：
\[\underset{\pi_{\boldsymbol\theta}}{\max}\,\bigl\{ \mathbb{E}_{x \sim \mathcal{D},\,y \sim \pi_{\boldsymbol\theta}(y\mid x)}\bigl[r_{\phi}(x,y) \;-\;\beta \log\bigl(\tfrac{\pi_{\theta}(y\mid x)}{\pi_{\text{ref}}(y\mid x)}\bigr)\bigr]\bigr\}.\]
提取常数 $-\tfrac{1}{\beta}$ 并进行恒等变换：
\[\underset{\pi_{\theta}}{\min}\,\Bigl\{ \mathbb{E}_{x \sim \mathcal{D},\,y \sim \pi_{\theta}(y\mid x)} \bigl[\log\tfrac{\pi_{\theta}(y\mid x)}{\pi_{\text{ref}}(y\mid x)}\;-\;\tfrac{1}{\beta}r_{\phi}(x,y)\bigr] \Bigr\}.\]
继续变换：
\[\underset{\pi_{\theta}}{\min}\,\Bigl\{ \mathbb{E}_{x \sim \mathcal{D},\,y \sim \pi_{\theta}(y\mid x)} \bigl[\log\tfrac{\pi_{\theta}(y\mid x)}{\pi_{\text{ref}}(y\mid x)}\;-\;\log e^{r_{\phi}(x,y)/\beta}\bigr] \Bigr\}.\]
得到：
\[\underset{\pi_{\theta}}{\min}\,\Bigl\{ \mathbb{E}_{x \sim \mathcal{D},\,y \sim \pi_{\theta}(y\mid x)} \Bigl[\log\tfrac{\pi_{\theta}(y\mid x)}{\pi_{\text{ref}}(y\mid x)\,\exp\!\bigl(r_{\phi}(x,y)/\beta\bigr)}\Bigr] \Bigr\}.\]

在这里我们构造了一个新的分布 $\pi_{r}(y\mid x)$：

\[\pi_{r}(y \mid x) = \frac{\pi_{\text{ref}}(y \mid x) e^{r_{\phi}(x,y) / \beta}}{Z(x)}\]

其中，这个 $Z(x)$ 就像一个归一化常数，保证 $\pi_r$ 是一个真正的概率分布（概率和为 1）。

\[Z(x) =\;\sum_{y}\,\pi_{\text{ref}}(y\mid x)\,\exp\!\bigl(\tfrac{r_{\phi}(x,y)}{\beta}\bigr).\]

在上式中，分子代表了在给定某个输入对 $(x, y)$ 时，模型获得奖励的期望值；而分母则汇总了在相同输入 $x$ 下，所有可能输出 $y$ 的奖励期望之和。这样的结构实际上实现了一个归一化操作，使得整个表达式的值被限制在 $[0, 1]$ 的区间内，从而满足我们构造概率分布的基本要求。

虽然我们并不清楚 $\pi_{r}(y \mid x)$ 的具体形式，但由于我们已经知道参考分布 $\pi_{\text{ref}}(y \mid x)$ 的精确表达，我们可以利用这一信息：只需将输入 $x$ 传入参考模型，然后对所有可能的 $y$ 进行遍历（或者抽取足够多的 $y$ 样本），便可以近似估计出 $\pi_{r}(y \mid x)$ 的分布。不过，需要注意的是，这种方法在实际操作中存在计算上的挑战，我们将在后续进一步讨论这些问题。

继续对 PPO Loss 等价变换：
\[\underset{\pi_{\theta}}{\min}\, \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_{\theta}(y\mid x)}\Bigl[\log \tfrac{\pi_{\theta}(y\mid x)} {\tfrac{\pi_{\text{ref}}(y\mid x)\,\exp\!\bigl(r_{\phi}(x,y)/\beta\bigr)}{\,Z(x)\,}\,Z(x)}\Bigr].\]
化简：
\[\underset{\pi_{\theta}}{\min}\, \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_{\theta}(y\mid x)} \Bigl[\log \tfrac{\pi_{\theta}(y\mid x)}{\pi_{r}(y\mid x)} \;-\;\log Z(x)\Bigr].\]
忽略掉与 $\pi_{\theta}$ 无关的 $\log Z(x)$，得到：
\[\underset{\pi_{\theta}}{\min}\, \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_{\theta}(y\mid x)} \Bigl[\log \tfrac{\pi_{\theta}(y\mid x)}{\pi_{r}(y\mid x)}\Bigr].\]
写成 KL 散度的形式（左下⻆的 $y \sim \pi_{\theta}(y \mid x)$ 没有了）：

\[\underset{\pi_{\theta}}{\min}\,\mathbb{E}_{x \sim \mathcal{D}}\Bigl[ \mathbb{D}_{KL}\bigl(\pi_{\theta}(y\mid x)\,\big\|\,\pi_{r}(y\mid x)\bigr)\Bigr].\]

现在，我们的目标就简化为关注 KL 散度那一部分。由于 KL 散度总是非负的，并且当两个分布完全一致时其值为 0，因此最优情况正是在 $\pi_{\theta}(y\mid x)$ 与 $\pi_{r}(y\mid x)$ 完全相等时达到的。这就给我们提供了一个显式解：PPO 的最优概率分布正好就是 $\pi_{r}(y\mid x)$。

换句话说，如果奖励模型的参数 $r_{\phi}$ 已经确定，那么 PPO 的最优解就是

\[\pi_{r}(y\mid x) =\;\frac{\pi_{\text{ref}}(y\mid x)\,\exp\!\Bigl(\tfrac{r_{\phi}(x,y)}{\beta}\Bigr)}{Z(x)}.\]

然而，在实际对齐训练中，我们使用的奖励函数 $r(x,y)$ 并不是随便设定的，而是通过数据训练得到的最优奖励模型。也就是说，我们首先利用数据训练出一个理想的奖励模型 $r_{\phi}^{\ast}$（相当于教练给棋局打分），然后在这个最优奖励模型的基础上，再进一步训练出真正能“下好棋”的对齐模型。由此，最优奖励模型 $r_{\phi}^{\ast}$ 与它训练出的最优对齐模型 $\pi^{\ast}_{r}$ 依然满足上面那个关系，即

\[\pi^{\ast}_{r}(y\mid x) =\;\frac{\pi_{\text{ref}}(y\mid x)\,\exp\!\Bigl(\tfrac{r_{\phi}^{\ast}(x,y)}{\beta}\Bigr)}{Z(x)}.\]

总结

首先，我们定义了一个整体的优化目标，用于对齐人类偏好。这个目标函数假设我们已经有了一个奖励函数，并旨在找到一个能最大化该目标的对齐模型。可以将其理解为，在下棋时，你希望根据教练（奖励函数）的评分，找到一种策略（下棋方法）使得你的胜率最高。数学上，这个目标写为：
\[\max_{\pi_{\boldsymbol\theta}} \left\{ \mathbb{E}_{x,\, y_{\text{w}},\, y_{\text{l}} \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_{\boldsymbol\theta} (y_{\text{win}} \mid x)}{\pi_\text{ref} (y_{\text{w}} \mid x)} - \beta \log \frac{\pi_{\boldsymbol\theta} (y_{\text{l}} \mid x)}{\pi_\text{ref} (y_{\text{lose}} \mid x)} \right) \right] \right\}.\]
接着，我们从这个优化目标出发，推导出在固定奖励函数 $r$ 情况下，对齐模型 $\pi$ 的显式解。类似于在棋谱中找到一条最优走法，这个显式解为：
\[\pi^{\ast}(y \mid x) = \frac{\pi_{\text{ref}}(y \mid x)\,e^{r_{\phi}(x,y) / \beta}}{Z(x)},\]
其中 $Z(x)$ 是归一化的 Partition Function，定义为
\[Z(x) = \sum\limits_{y} \pi_{\text{ref}}(y \mid x)\,e^{r_{\phi}(x,y) / \beta}.\]
最后，在实际训练中，我们通常不会单独训练奖励模型，而是在最优奖励模型 $r_{\phi}^{\ast}$ 的指导下，直接训练出最优的对齐模型。换句话说，就像你通过不断实战和教练点评，最终掌握了真正能下出好棋的策略一样，我们将上面的公式稍作调整，得到：
\[r_{\phi}(x,y) = \beta \log\frac{\pi^{\ast}(y \mid x)\,Z(x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x).\]

这三个步骤展示了从整体对齐目标出发，到推导出最优策略，再到将奖励模型和对齐模型关联起来的全过程。整个流程就像是：首先确定比赛的胜负标准（优化目标），接着从棋谱中总结出最佳走法（显式策略解），最后通过不断实战验证和反馈（奖励模型训练），最终获得一套既能正确评估也能实际应用的下棋策略。

5.2 跳过奖励模型的训练

虽然我们形式上得到了

\[\pi_{r}(y\mid x) =\;\frac{\pi_{\text{ref}}(y\mid x)\,\exp\!\bigl(\tfrac{r_{\phi}(x,y)}{\beta}\bigr)}{Z(x)},\]

但在实际中，这个显式解不太容易直接用，因为：

$Z(x)$ 很难估计：它需要对给定 prompt $x$ 下的所有回答 $y$ 进行穷举或采样，计算 $\exp(r_\phi)\cdot\pi_{\text{ref}}$ 并累计，代价十分高昂。
我们最初目标是想绕过“先训练奖励模型”这一步，直接一步到位地学出一个对齐模型。然而 $\pi_{r}$ 仍需已知 $r$，离我们想要的“直接学好棋”（而不是先学打分）还有距离。

于是，我们反向思考：如果我们有 $\pi^{\ast}$，能否推导出它对应的奖励函数 $r_{\phi}^{\ast}$？答案是可以的。由

\[\frac{\pi^{\ast}(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)} =\;\exp\!\Bigl(\tfrac{r_{\phi}^{\ast}(x,y)}{\beta}\Bigr),\]

可等价变形为：

\[r_{\phi}^{\ast}(x,y) =\;\beta\,\log\!\Bigl(\tfrac{\pi^{\ast}(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)}\Bigr)\;+\;\beta\,\log Z(x).\]

这样，就能把 $\pi^{\ast}$ 表达成 $r^{\ast}$ 的函数，从而把“训练奖励模型”与“训练对齐模型”之间的关系打通。

总结这一部分：

既然我们已经能够用最优的对齐模型 $\pi^{\ast}$ 表示出最优奖励模型 $r_{\phi}^{\ast}$（就好比你通过反复对局和复盘，总结出了一套最佳下棋策略，并由此推导出最准确的棋局评价标准），那么我们只需将 $\pi^{\ast}$ 直接代入奖励模型的训练目标中。换句话说，这就意味着，你可以看似在训练奖励模型，实际上却一步到位地得到了最优的对齐模型。这样一来，就实现了我们最初“让模型既会评估又能下好棋”的目标。

接下来，问题便回到了如何训练奖励模型上。通常，我们采用“偏好排序”的数据标注方法来训练奖励模型，这在棋谱分析中就好比有人事先标注了哪几步棋更佳。一般来说，有两种主要方法：

只生成 2 个回答
对于一个给定的提示 $x$（或一个棋局），只生成两个回答（走法）：例如 <prompt x, chosen y1, reject y2>。此时，人工标注会告诉我们哪一个走法更好。我们的目标是让奖励模型对被选中的回答（chosen）打出更高的分数，而对拒绝的回答（reject）打出较低的分数。
生成 K 个（K > 2）回答
对于同一个提示 $x$，生成多个回答（走法）：例如 <prompt x, y1, ..., yK>。假设人工标注给出了一种偏好排序 $\tau$，比如认为从高到低依次为 y2 > y3 > y1 > ... > yK。我们希望奖励模型能够对这个真实排序 $\tau$ 赋予最高的总分，而任何其他可能的排序得分都较低。

在一些训练框架（例如 ChatGPT 的实现）中，当生成的回答超过 2 个时，系统会将这些回答拆分成两两比较，从而使得目标函数与只生成 2 个回答时的目标保持一致。但在更一般的场景下，我们会将所有可能的回答偏好排序看作一个整体数据，并期望真实排序 $\tau$ 的得分最高。DPO 的推导正是基于这种整体偏好排序的思路，因此在接下来的部分，我们会分别针对 K=2 和 K>2 的情况，详细推导出 DPO 最终的目标函数形式。

BT 模型：只生成 2 个回答

想象你在下棋时，教练只给你展示两种走法——一种被标记为“好招”（chosen），另一种为“坏招”（reject）。在这种情况下，你的目标就是让你的“打分系统”（奖励模型）尽可能高地评估好招，而低地评估坏招，也就是说，希望“好招打败坏招”的概率尽可能大。为此，我们可以借助经典的 Bradley-Terry (BT) 模型进行建模。BT 模型最初在 1952 年提出，用于分析成对数据间的相对优势，广泛应用于体育比赛、市场调查等领域。对于一对数据 $y_1$ 和 $y_2$（在这里分别代表 chosen 和 reject 回答），BT 模型将“$y_1$ 打败 $y_2$”的概率表示为：

\[P(y_1 > y_2 \mid x) \;=\; \frac{e^{\lambda y_1}}{e^{\lambda y_1} + e^{\lambda y_2}},\]

其中，$\lambda y_1$ 和 $\lambda y_2$ 分别代表了两者的强度参数，就像在棋赛中，我们可以用过往胜率来衡量两位棋手的实力。类似地，当 $y_1$ 和 $y_2$ 分别代表 chosen 和 reject 回答时，这里的强度参数可以被视为奖励模型对这两个回答打出的分数。

我们的目标是让 $y_1$ 打败 $y_2$ 的概率尽可能大，也就是说，在整个标注数据集 $\mathcal{D} = \left{x^i, y_w^i, y_l^i\right}_{i=1}^N$ 中，我们希望 chosen 回答的表现远远优于 reject 回答。因此，奖励函数的总体优化目标可以设计为：

这里的最后一行正是 ChatGPT 等系统中使用的奖励模型优化目标。直观上讲，这个目标函数鼓励奖励模型输出的分数使得 chosen 回答比 reject 回答更具有优势。我们可以认为，这一步就类似于教练给棋谱中的两个走法打分，并要求好招的分数明显高于坏招。

假设我们找到了最优的奖励模型，其形式为

\[r_{\phi}^{\ast}(x,y) \;=\; \beta\,\log\!\Bigl(\frac{\pi^{\ast}(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)}\Bigr) \;+\; \beta\,\log Z(x),\]

将这一最优奖励函数代入前面的目标函数中，我们得到：

\[\begin{aligned} & \text{Reward Model Loss} \\ & =\underset{\pi^{\ast}}{\max}\,\Biggl\{\mathbb{E}_{(x,y_{\text{w}},y_{\text{l}}) \sim \mathcal{D}} \Biggl[ \log \sigma\Bigl( \beta\,\log\frac{\pi^{\ast}(y_{\text{win}} \mid x)}{\pi_{\text{ref}}(y_{\text{win}} \mid x)} +\beta\,\log Z(x) \;-\; \beta\,\log\frac{\pi^{\ast}(y_{\text{loss}} \mid x)}{\pi_{\text{ref}}(y_{\text{loss}} \mid x)} -\beta\,\log Z(x)\Bigr)\Biggr] \Biggr\} \\[6pt] & =\underset{\pi_{\theta}}{\max}\,\Biggl\{\mathbb{E}_{(x,y_{\text{w}},y_{\text{l}})\sim \mathcal{D}} \Biggl[ \log \sigma\Bigl( \beta\,\log\frac{\pi^{\ast}(y_{\text{win}} \mid x)}{\pi_{\text{ref}}(y_{\text{win}} \mid x)} -\beta\,\log\frac{\pi^{\ast}(y_{\text{loss}} \mid x)}{\pi_{\text{ref}}(y_{\text{loss}} \mid x)}\Bigr)\Biggr] \Biggr\}. \end{aligned}\]

这一结果说明，我们已经将奖励模型的训练目标转化为只依赖于对齐模型 $\pi$ 的优化。也就是说，通过这种方法，我们可以在实际操作中绕过单独训练奖励模型的步骤，直接使用标注好的“成对”偏好数据，就像你直接从棋谱中学习最佳走法一样，从而一步到位地训练出对齐模型 $\pi_\theta$。因此，我们对上述式子再稍加调整，将待训练的对齐模型设为 $\pi_\theta$，最终目标写作：

\[\begin{aligned} & \text{Reward Model Loss} \\ & =\underset{\pi_{\theta}}{\max}\,\Biggl\{\mathbb{E}_{(x,y_{\text{w}},y_{\text{l}})\sim \mathcal{D}} \Biggl[ \log \sigma\Bigl( \beta\,\log\frac{\pi_{\theta}(y_{\text{win}} \mid x)}{\pi_{\text{ref}}(y_{\text{win}} \mid x)} -\beta\,\log\frac{\pi_{\theta}(y_{\text{loss}} \mid x)}{\pi_{\text{ref}}(y_{\text{loss}} \mid x)}\Bigr)\Biggr] \Biggr\} \\[6pt] & = \text{DPO Loss}. \end{aligned}\]

PT 模型：生成 K (K>2) 个回答

想象一下，在一盘国际象棋比赛中，你不仅仅在考虑两个备选走法，而是一次性评估多种可能的走法。例如，在关键时刻，你的教练给你展示了 K 个不同的走法，并要求你根据每一步可能带来的局面变化来排序这些走法，确定哪一种走法最有可能带来胜利。这种情形对应于我们在 RLHF 中使用的“生成 K 个回答并进行偏好排序”的场景。

与只生成两个回答的 BT 模型不同（BT 模型只关注一对走法之间的对比），这里我们使用的是一种基于统计的 PT 模型（Plackett-Luce 模型），它可以对多个候选走法进行排序。假设 $\tau$ 表示人工标注出的真实偏好排序，换句话说，就是教练根据局势给出了一份理想的走法排名。我们希望这个真实排序 $\tau$ 能够击败所有其他可能的排序。为此，我们定义“真实排序 $\tau$ 击败其他任何排序”的概率为：

\[P(\tau \mid y_1, y_2, \dots, y_K, x) =\;\prod_{k=1}^{K}\frac{e^{r(x, y_{\tau_k})}}{\sum_{j=k}^{K} e^{r(x, y_{\tau_j})}}.\]

在这里，$\tau_k$ 表示真实排序 $\tau$ 中第 k 个走法（比如 $\tau_1$ 是最受青睐的走法，$\tau_2$ 是第二受青睐的，以此类推），而 $x$ 则代表当前棋局（或 prompt）。直观上，我们希望在真实排序中，最受偏好的走法（$\tau_1$）在所有候选走法中获得最高的得分；第二好的走法（$\tau_2$）在剩余走法中也应获得相对高分；依此类推。

接下来，我们将最优奖励函数 $r_{\phi}^{\ast}(x,y)$ 代入上式。首先，将 $r_{\phi}^{\ast}(x,y)$ 表示为

\[r_{\phi}^{\ast}(x,y) =\;\beta \,\log\!\Bigl(\frac{\pi^{\ast}(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)}\Bigr) \;+\;\beta\,\log Z(x),\]

然后，上式中的概率就变为

\[P(\tau \mid y_1, y_2, \dots, y_K, x) = \prod_{k=1}^{K}\frac{e^{r^*_\phi(x, y_{\tau_k})}}{\sum_{j=k}^{K} e^{r^*_\phi(x, y_{\tau_j})}}.\]

接着，我们利用 $\pi^{\ast}$ 来表达 $r_{\phi}^{\ast}$（注意：在这里 $Z(x)$ 可以被看作是一个与 $\pi$ 无关的归一化常数，可以在后续处理中省略），从而将表达式写成：

最后，对于整个数据集，我们希望真实排序 $\tau$ 的平均概率尽可能大，也就是说，我们的目标是最大化整个数据集中真实排序的概率。因此，针对多回答情况下的 DPO 目标函数可以写为：

\[\mathcal{L}_{\text{DPO}}(\pi_\theta, \pi_{\text{ref}}) =\;-\;\mathbb{E}_{\tau,\,y_1,\,y_2,\,\dots,\,y_K,\,x \sim \mathcal{D}}\!\left[\log \prod_{k=1}^{K}\frac{\exp\!\Bigl(\beta\,\log\frac{\pi_\theta(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)}\Bigr)}{\sum_{j=k}^{K}\exp\!\Bigl(\beta\,\log\frac{\pi_\theta(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)}\Bigr)}\right].\]

参考资料：

5.3 DPO 的局限性

经过上面的推导，我们已经看到了 DPO 的数学基础。然而，就像在国际象棋中，只靠阅读棋谱而不亲自下棋，虽然你可能学到很多评价走法的技巧，但却无法保证你在实际对局中能下出好棋。DPO 的核心目标是让模型学会以奖励模型的方式去评估回答的好坏，而它对标的并不是 PPO 本身，而是那个奖励模型。这意味着 DPO 在训练时，所用的数据和损失函数与奖励模型是完全一致的。

这就引出了一个关键问题：模型是否具备同时提升“评估能力”和“生成能力”的能力。换句话说，DPO 的训练流程只专注于让模型学会如何“打分”，就像你通过阅读棋谱来学习如何评价不同走法一样。但这并没有直接保证模型在实际下棋时能够做出最优决策——也就是说，它并没有证明“评价能力”能够直接转化为“下棋能力”。如果这种假设不成立，那么 DPO 的训练过程就失去了意义。正如我们不会认为只看棋谱就一定能下好棋一样，这个前提的成立与否也直接关系到其他类似 SPIN、self-reward 等方法的合理性。

另外，由于 DPO 的优化目标仅依赖于奖励模型的打分，它只关心模型生成的回答在相对于参考模型的分数变化上是否符合预期，而并不关心模型实际生成出的句子是否通顺、有吸引力。也就是说，DPO 更在意 loss margin 是否变大，而不在乎模型能否在实际对局中下出一手好棋。这就导致在训练 DPO 时，我们经常会遇到一个尴尬的现象：无论是好回答还是坏回答的 loss 都会上升，迫使我们不得不通过调整超参数或加入其他约束来稳定训练。

从另一个角度来看，DPO 的局限性还可以归纳为以下几点：

评价与生成的脱节
DPO 的训练过程只让模型学会“评价”，即像一个静态的棋谱评分系统，忽略了实际对局中需要的在线生成（generate）过程。PPO 则不同，它通过在线生成不断试错，实时获得反馈，从而将“评价能力”转化为“生成能力”。缺少了这种在线探索，DPO 训练出的模型虽然可能在离线数据上评分准确，但在实际生成时往往表现不佳。
离线训练的局限
RLHF 本质上是一种在线学习方法，因为它需要不断修正模型当前已有的知识，就像棋手需要不断实战来磨炼技艺。而 DPO 则是完全离线的，它强迫模型仅仅依赖于训练者认为“正确”的回答（比如棋谱中的好招），沿着一条预设的正确路线走下去，缺少了必要的探索（explore）的空间。实际上，我们常用一些技巧，比如先让模型对偏好数据中的好回答做一次监督微调（SFT），再进行 DPO 训练；或者利用模型自己生成的一些多样化结果构成偏好对，这些都试图在一定程度上引入在线和探索的元素。
数据质量要求高
由于 DPO 的训练完全依赖于离线的偏好数据，其效果对数据的质量和覆盖范围要求极高。如果训练数据不够全面或与实际生成分布不匹配，就可能出现这种情况：模型在生成正负样本时相对比例虽符合要求，但绝对生成的概率被稀释，甚至可能产生训练数据之外的奇怪回答。比如，在一个问答场景中，数据集正样本是“意大利面应该拌番茄肉酱”，负样本是“意大利面应该拌油泼辣子”，但 DPO 优化后模型可能会输出“意大利面应该拌 42 号混凝土”，这种偏差正说明了数据质量的重要性。

总的来说，DPO 的局限性在于它只关注于将奖励模型的“打分”能力传递给对齐模型，而没有包含实际的在线生成和探索过程。这就类似于你只看棋谱来学习走法，而没有亲自上棋练习；结果，即便你在理论上能准确评价每一步的好坏，也无法保证在实战中做出最佳选择。因此，虽然 DPO 能够一步到位训练出对齐模型，但如果缺少在线生成和探索的补充，其效果往往不如完整的 RLHF（即 Reward Model + PPO）体系。

Navigating the RLHF Landscape: From Policy Gradients to PPO, GAE, and DPO for LLM Alignment

1. On-Policy vs. Off-Policy Reinforcement Learning

The Essence of On-Policy Learning

Key Components in an On-Policy Framework

2. PPO (Proximal Policy Optimization)

2.1 Starting with Policy Gradient Optimization

2.2 Final Policy Gradient Formula

2.3 The REINFORCE Algorithm: Process and Implementation Steps

2.4 Challenges in Policy Gradient Optimization

2.5 Reducing Variance: Focusing Only on the Future

2.6 Reducing Variance: Introducing a Baseline

2.7 Reducing Variance: Introducing $Q$ and $V$

2.8 Explaining the Advantage Function

2.9 Estimating the Advantage Term – Using the Iterative GAE Strategy

2.10 The PPO Loss

2.11 Advantages of Using PPO

3. Understanding and Deriving GAE (Generalized Advantage Estimation)

3.1 Introducing the Concept of Residuals

3.2 The Bias-Variance Tradeoff

4. The Token-per-Token Process for Training LLMs with PPO

4.1 Aligning LLM with PPO

5. DPO: Learning Chess by Studying Chess Manual

5.1 Directly Solving for the Optimal Aligned Model from the Optimization Objective

5.2 Skipping the Training of the Reward Model

5.3 Limitations of DPO

大语言模型 RLHF 全链路揭秘：从策略梯度、PPO、GAE 到 DPO 的实战指南

1. 在线 (On-Policy) 和离线 (Off-Policy) 强化学习

2. PPO（近端策略优化）

2.1 从策略梯度优化（Policy Gradient Optimization）谈起

2.2 最终策略梯度公式（Final Policy Gradient Formula）

2.3 REINFORCE 算法流程与实现步骤

2.4 策略梯度优化面临的问题

2.5 减小方差：只关注未来

2.6 减小方差：参考线（Baseline）

2.7 减小方差：引入 $ Q $ 和 $ V $

2.8 优势函数的解释

2.9 如何估计优势项 - 使用基于迭代的 GAE 策略

2.10 PPO 损失函数（The PPO Loss）

2.11 使用 PPO 的优势

3. GAE（广义优势估计）理解及推导

3.1 引入残差的概念

3.2 偏差-方差的折衷（Bias-Variance Tradeoff）

4. 用 PPO 训练 LLM 的 Token-per-Token 过程

4.1 用 PPO 对齐 LLM

4.2 在第一次更新时，如果新模型和旧模型一模一样，PPO Loss 是 0 吗？

5. DPO：看着棋谱学下棋

5.1 从优化目标中直接求解最优对齐模型

5.2 跳过奖励模型的训练

5.3 DPO 的局限性

CATALOG

FEATURED TAGS

FRIENDS