Re-understanding KL Approximation from an RL-for-LLM Lens: Notes on “Approximating KL Divergence”

What’s the difference between KL-divergence estimation methods in PPO and GRPO?

John Schulman’s blog post “Approximating KL Divergence” talks about how to approximate KL divergence via sampling (Monte Carlo), and it introduces three estimators ($k_1$, $k_2$, $k_3$) along with their bias–variance behaviors. But the original post is framed in the context of general probability distributions; it doesn’t touch the reinforcement-learning-for-LLM training setting. This write-up records the questions I had while reading, the thoughts I formed after mapping things to RL for LLMs, and a few places where I felt the original explanations could be pushed on a bit.

What “Approximating KL Divergence” Says (in my own words)

In this section, I’m imagining readers who haven’t read the original post yet, so let’s quickly pass through the most important bits. Put simply, the post is about how we can build reasonable Monte Carlo–style estimators for it, when we can’t directly compute KL divergence.

\[\mathrm{KL}(q, p) = \sum_x q(x)\,\log\frac{q(x)}{p(x)} = \mathbb{E}_{x\sim q}\!\left[\log\frac{q(x)}{p(x)}\right].\]

As the formula shows: when estimating the KL between two (complicated) distributions, there’s a coding trick people often use: approximate KL by the sample mean of $\log!\big(\frac{q(x)}{p(x)}\big)$ with samples drawn from $q$ (as opposed to trying to evaluate the full expectation exactly). The post then points out another move: use the sample average of $\tfrac{1}{2}(\log r)^2$ to replace the more “standard” $\log r$ form, where $r=\frac{q(x)}{p(x)}$. The write-up explains why this expression can be a good (albeit biased) estimator for KL, and how to make it unbiased while keeping its low variance.

How we compute KL depends on how we can access $p$ and $q$. Here we assume we can evaluate $p(x)$ and $q(x)$ (probabilities or densities) for any $x$, but we can’t analytically sum/integrate over $x$. Why might we fail to do the analytic sum/integral? Maybe the exact computation is too expensive in compute or memory, maybe there’s no closed form, or maybe we only store log-probs instead of full distributions to keep code simpler, especially fine when KL is just a diagnostic (as is often the case in RL). The most common strategy to approximate sums or integrals is Monte Carlo. Given samples $x_1, x_2, \dots, x_n \sim q$, how do we build a good estimator?

A good estimator should be unbiased (right mean) and low-variance. We know one unbiased estimator:

\[k_1 = \log\frac{q(x)}{p(x)}.\]

But it has high variance: by definition KL is a nonnegative quantity, yet for the estimator above, roughly “half” the sample values can be negative (if we assume no prior structure on $p$ and $q$), and this swings the average around a lot, hence high variance. For notational convenience, set $r = \frac{q(x)}{p(x)}$. Then the original KL can be written as

\[\mathrm{KL}[q, p] \;=\; \mathbb{E}_{x\sim q}\,[\log r].\]

To reduce variance, we can design an alternative estimator:

\[k_2 = \frac{1}{2}(\log r)^2.\]

It has lower variance, but it’s biased. Intuitively, $k_2$ feels nicer because each sample gives a nonnegative “distance” between $p$ and $q$, so it stays positive. Empirically, $k_2$ really does have much lower variance than $k_1$, and the bias can be quite small. As for why $k_2$ enjoys such a variance drop compared with $k_1$, the original post uses an $f$-divergence view to give an analytic explanation, I won’t repeat that here.

Now, can we get an estimator that is both unbiased and low-variance? A general trick is to use a control variate: start from the unbiased $k_1$ and add something whose expectation is zero and is negatively correlated with it to reduce variance. A very convenient zero-mean quantity here is $r-1$. Thus, for any $\lambda$,

\[k \;=\; -\log r + \lambda\,(r-1)\]

is still an unbiased KL estimator. In theory we could minimize the variance over $\lambda$, but the closed-form depends on $p$ and $q$ and isn’t easy to get. Notice, though, that since $\log(x)$ is concave,

\[\log(x) \;\le\; x-1,\]

so if we pick $\lambda=1$, the expression is guaranteed nonnegative. Here, $r-1$ is the tangent of $\log r$ at $r=1$. So with $\lambda=1$ we’re really measuring the vertical gap between $\log(x)$ and its tangent. This leads to the estimator $k_3 \;=\; (r - 1) \;-\; \log r,$ which is always nonnegative. And $k_3$ is exactly the piece where, in practice, GRPO differs from PPO in how KL is estimated (PPO uses $k_1$).

Discussing KL Estimation from an RL-for-LLM Perspective

In RL (think PPO, GRPO, etc.), we often tack a KL divergence term onto the loss to keep the new policy from drifting too far from the old one. Here, $q$ is the old policy distribution ($\pi_{\text{old}}$), $p$ is the new policy distribution ($\pi_{\text{new}}$), and $x$ is a complete action sample (in an LLM this means a token or a token sequence). We usually use $s$ to denote the state (in an LLM, that’s the prompt or context), and $x$ is a specific token generated in that context. When we compute KL, what we’re really doing is taking the KL over the action distribution given a state, and then averaging over states:

\[\mathrm{KL}[p, q] = \mathbb{E}_{s} \left[ \sum_x p(x|s) \log \frac{p(x|s)}{q(x|s)} \right].\]

At sampling time, we typically fix a prompt (state) and then estimate this KL for that prompt.

So why can’t we just compute KL exactly instead of estimating it? The reasons are exactly those listed in the original blog post; in RL for LLMs, the main culprit is Reason #1: the action space (token space) is too large to sum/integrate over all possible $x$. For example, if a tokenizer has 50,000 vocabulary entries, even computing the KL for a single token means summing over 50,000 actions; and in RL we’re usually doing multi-step (sequence) generation, so the space blows up exponentially, which is completely impractical. There’s also a pragmatic reason: during training we generally don’t store the full distribution (all token probabilities); we only keep the log-probs of the tokens actually generated along the trajectory, to save GPU memory and I/O. So we have to use Monte Carlo sampling: draw $x$ from some distribution (usually $q$, the old policy), and use those samples to approximate KL. And that drops us squarely into the territory the blog post is about.

In that post, the estimator we keep talking about is really just a function of a sample: it takes $p(x)$ and $q(x)$ for some sampled $x$ (or their ratio $r = \frac{q(x)}{p(x)}$) and spits out a number. We then take the average of those numbers over our samples to approximate KL. For example:

$k_1(x) = -\log r$
$k_2(x) = \frac12 (\log r)^2$
$k_3(x) = (r - 1) - \log r$

These $k_i$ are just different KL-estimator formulas. They all approximate KL by averaging over samples, but differ in bias and variance. Once we pick an estimator, we’re really just committing to a specific formula for approximating KL. The process looks like this:

Sampling
Sample a batch of tokens (or sequences) $x_1, x_2, \dots, x_N$ from the old policy $q$.
Compute log-probs
For each sample, compute the log-probabilities under both new and old policies:

\[\log p(x_i),\ \log q(x_i)\]

and get $r_i = \frac{q(x_i)}{p(x_i)}$ or $\log r_i$.

Plug into the estimator formula
For example, if we choose $k_3$:

\[k_3(x_i) = (r_i - 1) - \log r_i\]

Average

\[\widehat{\mathrm{KL}} \approx \frac1N \sum_{i=1}^N k_3(x_i)\]

That’s the approximate KL value, standing in for the true KL.

If we compare this to computing the true KL (no estimation) for a discrete probability distribution (LLM single-token step): we’d need to iterate over every possible token $x$:

\[\mathrm{KL}(p\|q) = \sum_x p(x) \log \frac{p(x)}{q(x)}\]

You can see immediately that with an estimator, the computational load is much smaller than doing the full sum, especially in high-dimensional action spaces.

Talking About Variance in Different KL Estimators

Important to note: the “variance” we’re talking about here is the variance of the values the estimator outputs over samples:

\[\mathrm{Var}_{x \sim q}[k(x)]\]

That is, how much $k(x)$ fluctuates across the sample space. An unbiased estimator means that with infinitely many samples, its mean equals the true KL. But a high-variance estimator means that even if the mean is right (unbiased), with a small number of samples the average can be way off. In RL for LLMs, the KL term is often a regularization factor in the loss (e.g., $\beta \cdot \mathrm{KL}$). If the KL estimator’s variance is large, it makes the loss noisy, which in turn makes gradients noisy and training unstable.

In the original post, to give readers an intuition for why $k_1$ is not low-variance, the author writes:

However, it ($k_1$) has high-variance, as it’s negative for half of the samples, whereas KL is always positive.

The author points out that although $k_1$ is unbiased, without prior constraints on $p$ and $q$, half the samples will have one bigger than the other, so half of the $k_1$ values are positive and half negative. Up to here I’m fine. But then the author says: because KL is always greater than 0 (a basic inequality), $k_1$ must therefore have high variance. And here I think the causal link doesn’t actually hold: you can’t use the sign of the expectation to dictate the sign of individual samples. A quick counterexample: in computing the expectation, $p(x) \log \frac{p(x)}{q(x)}$ is also sometimes positive and sometimes negative; that fact by itself tells you nothing about variance. In reality, a single-sample log ratio (whether $\log \frac{q(x)}{p(x)}$ or $\log \frac{p(x)}{q(x)}$) can be positive or negative, just like $k_1$, so sign-flipping alone is not the sole reason for high variance.

From the KL definition:

\[\mathrm{KL}(q \| p) = \mathbb{E}_{x\sim q}\left[ \log \frac{q(x)}{p(x)} \right]\]

The expectation is guaranteed nonnegative, but the integrand $\log\frac{q(x)}{p(x)}$ can be positive or negative for individual samples. And $k_1$ is exactly that integrand:

\[k_1(x) = \log \frac{q(x)}{p(x)}\]

So each sample value can indeed be positive or negative, same as the integrand in the KL definition.

So why does $k_1$ have high variance?

It’s not the mere “sign flipping.” The real reason is that $k_1$’s value distribution is often wide (heavy-tailed). For example, if $p(x)$ is tiny for some sample, then $\log\frac{q}{p}$ can be huge (positive or negative). These extreme values dominate the finite-sample average, pushing variance up. In other words, it’s the combination of extreme values + positive/negative cancellation: cancellation means you need more samples to converge to the true mean, and extreme values make the sample variance itself larger. So the “half negative” comment in the blog is more of an intuition hook than a complete explanation.

From this perspective, if we look at the other estimators $k_2$ and $k_3$, we see: $k_2 = \frac12 (\log r)^2$ is always positive, so there’s no cancellation, but this introduces bias; squaring also smooths the magnitude, reducing variance. $k_3$ uses a control variate to knock out part of the fluctuation source, lowering variance while keeping unbiasedness (details next).

In PPO/GRPO, if you use $k_1$ and the batch is small or the distributions are far apart, the KL estimate will jump around (because a few extreme samples can swing the mean hard). That makes the KL-penalty coefficient unstable: it might suddenly be way too strong or too weak. Switching to a lower-variance estimator ($k_2$ or $k_3$) makes each sample’s KL contribution steadier, less likely to be dominated by a handful of extreme samples.

Why can $k_3$ be unbiased and low-variance?

At first glance, $k_3$ is always positive, so you might think its mean must be larger than $k_1$’s.
But remember: $k_3$ is derived from $k_1$ via a control variate. The blog’s reasoning goes like this:

\[\tilde{k}(x) = k_1(x) + \lambda \cdot h(x)\]

where $h(x) = r - 1$, and under $x\sim q$ its expectation is:

\[\mathbb{E}_{x\sim q}[h(x)] = \mathbb{E}_q\left[\frac{p(x)}{q(x)} - 1\right] = \sum_x p(x) - 1 = 1 - 1 = 0.\]

So adding any multiple of $h(x)$ doesn’t change the expectation. When $\lambda = 1$:

\[\tilde{k}(x) = -\log r + (r - 1) = (r - 1) - \log r = k_3(x).\]

This explains why $k_3$’s expectation equals $k_1$’s expectation, and equals the KL, making it an unbiased estimator.

The reason $k_3$ has lower variance than $k_1$ is: $k_1$ only has $-\log r$, which can swing wildly (both positive and negative, with occasional huge values). But $r - 1$ and $-\log r$ are numerically highly correlated (when one grows, the other grows/shrinks), and that correlation is negative. Adding $(r - 1)$ is like injecting a negatively correlated term to cancel fluctuations. After cancellation, what’s left in $k_3$ is tighter in range, always positive, and therefore lower in sample variance.

从 RL for LLM 视角重新理解 KL 估计：读《Approximating KL Divergence》笔记

PPO 和 GRPO 中使用的 KL 散度估计算法有什么不同？为什么会这样？

John Schulman 的博客《Approximating KL Divergence》讲述了如何用采样（Monte Carlo）近似 KL 散度，并介绍了三种估计器（$k_1$、$k_2$、$k_3$）及它们的偏差与方差特性。但原文是在通用概率分布背景下讨论的，没有涉及 LLM 的强化学习训练场景。这篇文章记录了我在阅读原文时产生的疑问、结合 RL for LLM 的思考，以及对原文某些解释的质疑。

《Approximating KL Divergence》讲了什么

在这一个章节，我们带着没有读过这篇文章的读者，快速过一遍这篇文章中讲的最重要的内容。简单得说，这篇文章讨论了在无法直接计算出 KL 散度时，我们如何基于蒙特卡洛采样的方式构造合理的 KL 散度的优化器。

\[\text{KL}(q, p) = \sum_x q(x) \text{log}\frac{q(x)}{p(x)} = \mathbb{E}_{x\sim q}\left [\text{log}\frac{q(x)}{p(x)}\right ]\]

如上式所示：在估计两个（复杂）分布的 KL 散度的时候，代码中常用的一个技巧是：将 KL 散度近似为 $\text{log}(\frac{q(x)}{p(x)})$ 的样本均值（样本来自 $q$，而不是更标准的 $\mathbb{E}_q [\text{log}(\frac{q(x)}{x(x)})]$。用 $\frac{1}{2} (\log r)^2$ 的样本平均代替更标准的 $\log r$ 形式。本文将解释为什么这个表达式是一个不错（尽管有偏）的 KL 估计器，以及如何在保持其低方差的同时使它无偏。

我们能计算 KL 的方法取决于我们对 $p$ 和 $q$ 的访问方式。这里我们假设可以计算任意 $x$ 的概率（或概率密度）$p(x)$ 和 $q(x)$，但不能解析地对 $x$ 求和（或积分）。为什么会不能解析求和呢？有可能精确计算需要过多的计算量或内存，没有解析解，或者我们只存储 log-prob 而不是整个分布，这能简化代码。这是合理的选择，尤其当 KL 只是作为诊断使用时（例如在强化学习中常见的情况）。最常见的求和或积分估计策略是 蒙特卡罗估计。给定样本 $x_1, x_2, \dots, x_n \sim q$，我们如何构造一个好的估计器？

一个好的估计器应当无偏（均值正确）且低方差。我们知道一个无偏估计器（PPO 的选择）是：

\[k_1 = \log \frac{q(x)}{p(x)}\]

但它的方差很高，因为KL散度就定义而言一定是一个正值，但是对于上边的估计来说，它的一半的样本是负的（假设我们在没有任何对 $p$ 和 $q$ 的先验知识下），所以他的方差很高。为了简化表达，我们记：$r = \frac{q(x)}{p(x)}$，那么原始的 KL 散度的估计可以写为：

\[\mathrm{KL}[q, p] = \mathbb{E}_{x \sim q} [\log r]\]

为了减少方差，我们可以设计另外一个替代的 estimator：

\[k_2 = \frac{1}{2} (\log r)^2\]

它的方差低，但有偏。直觉上来说，$k_2$ 更好，因为每个样本都给出了 $p$ 和 $q$ 之间的“距离”，因而它总是正的。经验上来说，$k_2$ 的方差确实远低于 $k_1$，而且偏差也小了很多很多。至于为什么 $k_2$ 比 $k_1$ 的方差要低很多，原文中，利用f-散度给出了一个分析性的解释，在这里不再赘述。

那么有没有可能构造一个无偏并且方差又小的 KL 散度的估计器呢？一个降低方差的一般方法是使用控制变量（control variate）：我们已知 $k_1$ 是一个无偏估计，那么我们取 $k_1$，加上一个期望为零、且与其负相关的项，这样就能减小方差。我们能够轻而易举得到的一个期望为0的量是 $r-1$。于是，对于任意 $\lambda$：

\[k = -\log r + \lambda (r - 1)\]

都是无偏的 KL 估计器。虽然理论上我们可以通过优化来最小化方差并解出最优 $\lambda$，但结果依赖于 $p, q$，难以解析求出。注意到由于 $\log(x)$ 是凹的：

\[\log(x) \le x - 1\]

因此若取 $\lambda = 1$，上式总为正。这里，$r - 1$ 其实是 $\text{log}r$ 在 $r = 1$ 处的切线。所以当 $\lambda=1$，我们其实是在测量 $\log(x)$ 与其切线的垂直距离。所以我们可以得到估计器：

\[k_3 = (r - 1) - \log r\]

所以它总是正的。而 $k_3$ 正是 GRPO 在 KL 散度估计上区别于 PPO 的一大特点（PPO 使用的是 $k_1$）。

在 RL for LLM 的角度讨论 KL 散度的估计

在 RL（比如 PPO、GRPO 等）里，我们经常会在 loss 里加一个 KL 散度，用来约束新策略（policy）不要偏离旧策略太多。这里，$q$ 是旧策略分布（$\pi_{\text{old}}$，而 $p$ 是新策略分布（$\pi_{\text{new}}$），$x$ 则是一个完整的动作样本（在 LLM 里就是 token 或 token 序列）。我们通常用 $s$ 表示 state（在 LLM 里就是 prompt 或上下文）， $x$ 就是在这个上下文下生成的一个具体 token。在计算 KL 时，我们实际上是对 给定状态下的动作分布 做 KL，然后再对状态取期望：

\[\mathrm{KL}[p, q] = \mathbb{E}_{s} \left[ \sum_x p(x|s) \log \frac{p(x|s)}{q(x|s)} \right]\]

在采样时，我们通常固定一个 prompt（state），然后针对它去估计这个 KL 值。

那么为什么无法直接计算 KL 的值，而一定要估计呢？ 原因正如原文提到的三条，在 RL for LLM 里主要是 第 1 条：“动作空间（token space）太大，无法对所有可能的 $ x $ 枚举求和/积分。” 比如一个 tokenizer 有 50,000 个词表，即便只计算一个 token 的 KL，都要对这 50,000 个动作求和；而 RL 中通常是多步生成（sequence），这个空间就会是指数级的，因而根本不可行。另外，还有一个现实原因：在训练时，我们通常没有直接保存完整的分布（所有 token 的概率），只保存了生成路径上选中的 token 的 log-prob。这是为了节省显存和 I/O 开销。因此我们只能用 蒙特卡洛采样：从某个分布（通常是 $ q $，即旧策略）采样一些 $ x $，用这些样本来近似 KL。这里就回归到了上边这个博客所讨论的范畴。

在上边这篇文章里，我们一直在讨论的估计器（estimator），其实就是一个基于采样的函数，它输入的是在某个样本 $x$ 上 $p(x)$ 和 $q(x)$ 的值（比如他们的比值 $r = \frac{q(x)}{p(x)}$），输出一个数，我们用这个数的样本均值用于近似 KL 散度。比如：

$k_1(x) = -\log r$
$k_2(x) = \frac12 (\log r)^2$
$k_3(x) = (r - 1) - \log r$

这些 $k_i$ 就是不同的形式的 KL-estimator，它们都能通过在采样上取平均来近似 KL 散度，只是无偏性、方差大小不同。当我们选定了估计器，我们其实就决定用哪种公式去近似 KL。操作过程是：

采样从旧策略 $q$ 采样一批 token（或序列）$x_1, x_2, \dots, x_N$。
计算 log-prob 对每个样本，计算新策略和旧策略下的对数概率：

\[\log p(x_i),\ \log q(x_i)\]

并求出 $r_i = \frac{q(x_i)}{p(x_i)}$ 或 $\log r_i$。

代入 estimator 公式 比如选 $k_3$，就算：

\[k_3(x_i) = (r_i - 1) - \log r_i\]

求平均

\[\widehat{\mathrm{KL}} \approx \frac1N \sum_{i=1}^N k_3(x_i )\]

这就是 KL 的近似值，用它替代真实 KL。

如果我们对比 KL 散度的真实计算过程（不估计），对于离散型概率分布（LLM 单步 token）：我们要遍历每一个可能的token $x$：

\[\mathrm{KL}(p\|q) = \sum_x p(x) \log \frac{p(x)}{q(x)}\]

这里可以通过对比看出，使用 estimator 后计算量其实是巨大的。

关于不同 KL Estimator 的方差的讨论

需要指出的是，这里我们所说的方差，是说通过估计器采样出来的值的方差

\[\mathrm{Var}_{x \sim q}[k(x)]\]

即估计器 $k(x)$ 在样本空间上的波动程度。一个无偏的估计器，是指在无限多样本下，它的均值等于真实 KL；那么一个方差大的估计器，就可能造成即便期望正确（无偏），少量样本的平均值也可能偏离很远。在 RL for LLM 中，KL 项通常是 loss 里的一个正则化约束系数（比如 $\beta \cdot \mathrm{KL}$），如果 KL 估计器的方差大，就会导致 loss 抖动，进而导致梯度抖动、训练不稳定。

原文中，作为为了让读者直观地理解为什么 $k_1$ 不是一个小方差的估计器，给出了如下解释：

However, it ($k_1$) has high-variance, as it’s negative for half of the samples, whereas KL is always positive.

作者指出，$k_1$ 虽然是无偏估计，但是由于 $p$ 和 $q$ 在无先验知识的条件下，采样出来的样本值总是一半比对方大，一半比对方小，所以导致 $k_1$ 采集到的数据总是一部分大于 $0$ 另一半小于 $0$，到这里我认为都是没问题的，但是作者后边说：因为 KL 总是大于 0 的（可由基本不等式推出），所以 $k_1$ 的方差很高。这里因果关系不成立的原因在于，我们不能用期望值的正负去要求估计值采样的正负。一个简单的反驳是，如果计算期望，$p(x) \log \frac{p(x)}{q(x)}$ 也是一半是正一半是负，这并不能说明什么。其实单次采样的 log 比值（无论是 $\log \frac{q(x)}{p(x)}$ 还是 $\log \frac{p(x)}{q(x)}$）确实都可能正也可能负，和 $k_1$ 一样会“跳变”，所以正负跳变本身并不是导致方差大的唯一原因。

根据 KL 散度的定义：

\[\mathrm{KL}(q \| p) = \mathbb{E}_{x\sim q}\left[ \log \frac{q(x)}{p(x)} \right]\]

这个期望是必然非负，但里面的被积项 $\log\frac{q(x)}{p(x)}$ 在单个样本上可以正也可以负。$k_1$ 其实就是直接取这个被积项：

\[k_1(x) = \log \frac{q(x)}{p(x)}\]

所以它的每个样本值确实可能正也可能负，和 KL 定义中的 integrand 一样。

那为什么 $k_1$ 方差大？

不是因为“正负跳变”本身，而是因为 $k_1$ 的取值分布通常很宽（heavy-tailed），例如，在概率很小的样本 $p(x)$ 下，$\log\frac{q}{p}$ 可以变得非常大（正或负）。这种极端值在有限样本平均中占据很大权重，所以会导致高方差。换句话说，问题是极端值 + 正负抵消的组合：正负抵消让你需要更多样本才能收敛到真实均值，极端值让样本方差本身更大。所以文章里提的“因为一半是负的”只是一个直观类比，并不是完整的方差来源解释。

如果我们从这个角度，来看其他估计器 $k_2$ 和 $k_3$，就会发现：$k_2 = \frac12 (\log r)^2$ 永远为正，去掉了正负抵消的可能，因为带来了偏差；但是引入了平方，使得数值变得更平滑，使方差下降；$k_3$ 通过控制变量消掉了一部分波动源，在保留无偏性的同时降低了方差，具体请看下边的解释。

在 PPO/GRPO 里，如果用 $k_1$，当 batch 小、分布差距大时，KL 估计值会很抖动（因为几个极端样本能显著改变均值），同时会使 KL-penalty 系数调节不稳（可能突然过强或过弱），换成低方差估计器（$k_2$ 或 $k_3$）时，batch 中的每个样本对 KL 的贡献更稳定，不容易被少数极端样本主导。

为什么 $k_3$ 能做到无偏且小方差？

乍一看，$k_3$ 全是正数，直觉会以为它的均值要大于 $k_1$ 的均值。
但别忘了 $k_3$ 是通过 控制变量（control variate） 从 $k_1$ 推出来的。文章中给的思路是：

\[\tilde{k}(x) = k_1(x) + \lambda \cdot h(x)\]

其中 $h(x) = r - 1$，在 $x\sim q$ 下的期望是：

\[\mathbb{E}_{x\sim q}[h(x)] = \mathbb{E}_q\left[\frac{p(x)}{q(x)} - 1\right] = \sum_x p(x) - 1 = 1 - 1 = 0\]

所以加上任何倍数的 $h(x)$，不会改变期望值。当取 $\lambda = 1$ 时：

\[\tilde{k}(x) = -\log r + (r - 1) = (r - 1) - \log r = k_3(x)\]

这就解释了为什么 $k_3$ 的期望等于 $k_1$ 的期望，也等于 KL，即是一个无偏估计。

$k_3$ 比 $k_1$ 的方差小的原因在于：$k_1$ 里只有 $-\log r$，波动可能很大（正负都有，大值出现时特别大）；而 $r - 1$ 和 $-\log r$ 在数值上高度相关（一个变大时另一个也变大/变小），而且相关性是负的，加上 $(r - 1)$ 相当于用一个负相关的量去抵消波动。抵消之后剩下的 $k_3$，取值范围收紧，且全为正值，最终会导致样本方差下降。

Re-understanding KL Approximation from an RL-for-LLM Lens: Notes on “Approximating KL Divergence [En/中]

从 RL for LLM 视角重新理解 KL 估计：读《Approximating KL Divergence》笔记

Re-understanding KL Approximation from an RL-for-LLM Lens: Notes on “Approximating KL Divergence”

What “Approximating KL Divergence” Says (in my own words)

Discussing KL Estimation from an RL-for-LLM Perspective

Talking About Variance in Different KL Estimators

从 RL for LLM 视角重新理解 KL 估计：读《Approximating KL Divergence》笔记

《Approximating KL Divergence》讲了什么

在 RL for LLM 的角度讨论 KL 散度的估计

关于不同 KL Estimator 的方差的讨论

CATALOG

FEATURED TAGS

FRIENDS

Re-understanding KL Approximation from an RL-for-LLM Lens: Notes on “Approximating KL Divergence”

What “Approximating KL Divergence” Says (in my own words)

Discussing KL Estimation from an RL-for-LLM Perspective

Talking About Variance in Different KL Estimators

从 RL for LLM 视角重新理解 KL 估计：读《Approximating KL Divergence》笔记

《Approximating KL Divergence》 讲了什么

在 RL for LLM 的角度讨论 KL 散度的估计

关于不同 KL Estimator 的方差的讨论

CATALOG

FEATURED TAGS

FRIENDS

《Approximating KL Divergence》讲了什么