当前位置：首页 > article >正文

Gumbel Softmax重参数和SF估计（Score Function Estimator，VAE/GAN/Policy Gradient中的重参数）

article 2025/2/23 17:49:39

Gumbel Softmax

We derive the probability density function of the Gumbel-Softmax distribution with probabilities $\pi_1, \ldots, \pi_k$ and temperature $\tau$ . We first define the logits $x_i = \log \pi_i$ , and Gumbel samples $g_1, \ldots, g_k$ , where $g_i \sim \text{Gumbel}(0, 1)$ 【 $\text{Gumbel}(0, 1)$ stands for sampling from uniform distribution $\text{U}(0,1)$ to get $u_i$ first, then $g_i=-log(-log(u_i))$ 】. A sample from the Gumbel-Softmax can then be computed as:

$\begin{equation} y_i = \frac{\exp((x_i + g_i)/\tau)}{\sum_{j=1}^{k} \exp((x_j + g_j)/\tau)} \quad \text{for } i = 1, \ldots, k \end{equation}$

Gumbel Code

import numpy as np

def sample_gumbel(shape, eps=1e-20):
    """
    从 Gumbel(0,1) 分布中采样
    :param shape: 采样的形状
    :param eps: 防止 log(0) 的小值
    :return: 从 Gumbel(0,1) 分布中采样的值
    """
    U = np.random.uniform(0, 1, shape)
    return -np.log(-np.log(U + eps) + eps)

# 示例：采样一个标量
sample = sample_gumbel(())
print("Sample from Gumbel(0,1):", sample)

# 示例：采样一个形状为 (3, 3) 的数组
samples = sample_gumbel((3, 3))
print("Samples from Gumbel(0,1):\n", samples)

Gumbel Softmax的意义

简单来说，Gumbel Max就是发现：
$\begin{equation} P[\arg\max(x + \epsilon) = i] = \text{softmax}(x)_i, \quad \epsilon \sim \text{Gumbel Noise} \end{equation}$

怎么理解这个结果呢？首先，这里的 $\epsilon \sim \text{Gumbel Noise}$ 是指 $\epsilon$ 的每个分量都是从 $\textcolor{green}{Gumbel分布}$ 独立重复采样出来的；接着，我们知道给定向量 $x$ ，本来 $\arg\max(x)$ 是确定的结果，但加了随机噪声 $\epsilon$ 之后， $\arg\max(x + \epsilon)$ 的结果也带有随机性了，于是每个 $i$ 都有自己的概率；最后，Gumbel Max告诉我们，如果加的是Gumbel噪声，那么 $i$ 的出现概率正好是 $\text{softmax}(x)_i$ 。

Gumbel Max最直接的作用，就是提供了一种从 $\text{softmax}(x)$ 中采样的方式，当然如果单纯采样还有更简单的方法，没必要“杀鸡用牛刀”。Gumbel Max最大的价值是“重参数化（Reparameterization）”，它将问题的随机性从带参数 $\alpha$ 的离散分布转移到了不带参数的 $\epsilon$ 上，再结合 $\text{softmax}$ 和 $\arg\max$ 的光滑近似，我们得到 $\text{softmax}(x + \epsilon)$ 是Gumbel Max的光滑近似，这便是Gumbel Softmax，是训练“离散采样模块中带有可学参数”的模型的常用技巧。

Score Function Estimator（SF估计）

现在我们得到了梯度的一个估计式，称为“SF估计”，全称是Score Function Estimator，这是对原来损失函数的最朴素的估计，在强化学习中 $z$ 代表着策略梯度，所以有时候也直接称上述估计为REINFORCE。要注意，对离散情形的损失函数数重新推导一遍，结果也是一样的，也就是说，上述结果是通用的，不区分 $z$ 是连续变量还是离散变量。现在我们可以直接从 $p_\theta(z)$ 中采样若干点来估计下面公式的值，不用担心会不会没梯度，因为下面公式本身就是梯度了。这刚好是Policy Gradient中的公式

$\begin{equation} \begin{aligned} \frac{\partial}{\partial \theta} \int p_\theta(z) f(z) dz &= \int \frac{\partial}{\partial \theta} p_\theta(z) f(z) dz \\ &= \int p_\theta(z) \frac{f(z)}{p_\theta(z)} \frac{\partial}{\partial \theta} p_\theta(z) dz \\ &= \mathbb{E}_{p_\theta(z)} \left[ f(z) \frac{\partial}{\partial \theta} \log p_\theta(z) \right] \\ &= \mathbb{E}_{p_\theta(z)} \left[ f(z) \frac{\partial}{\partial \theta} \log p_\theta(z) \right] \end{aligned} \end{equation}$