当前位置：首页 > article >正文

旋转位置编码(RoPE)公式详细推导过程

article 2025/2/10 6:45:33

RoPE是从Sinusoidal位置编码改进而来的，因此可先去看Sinusoidal位置编码公式推导过程，再来看本文章，会更好地理解。

旋转位置编码(RoPE)公式推导过程

参考链接

Transformer升级之路：1、Sinusoidal位置编码追根溯源

1. 前置数学公式

1.1 复数的指数形式

复数可以通过欧拉公式表示为：
$e^{i\theta} = r(\cos(\theta) + i \sin(\theta))$
其中， $r$ 是复数的模， $\theta$ 是复数的幅角， $i$ 是虚数单位。

1.2 复数的乘法

对于两个复数 $z_1 = r_1 e^{i\theta_1}$ 和 $z_2 = r_2 e^{i\theta_2}$ ，它们的乘积为：
$z_1 z_2 = r_1 r_2 e^{i(\theta_1 + \theta_2)}$

1.3 复数的共轭

复数 $e^{i\theta}$ 的共轭为：
$z^* = r e^{-i\theta}$

1.4 旋转矩阵

二维空间中的旋转矩阵 $R(\theta)$ 用来表示一个向量绕原点旋转 $\theta$ 角度，表示为：
$R(\theta) = \begin{pmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \end{pmatrix}$

$R(\theta)$ 是由对一个二维向量$\begin{pmatrix} x \ y \end{pmatrix} $在复数平面上进行旋转$ \theta$角度得到的，推导过程如下：
$\begin{aligned} z &= (cos(\theta) + i sin(\theta))(x + y i) \\ &= (cos(\theta)x + cos(\theta)y i + sin(\theta)xi - sin(\theta)y) \\ &= (cos(\theta) + sin(\theta)i)x + (-sin(\theta) + cos(\theta)i)y \\ &= (1, i) \begin{pmatrix} cos(\theta)x & -sin(\theta)y \\ sin(\theta)x & cos(\theta)y \end{pmatrix} \\ &= (1, i) \begin{pmatrix} cos(\theta) & -sin(\theta) \\ sin(\theta) & cos(\theta) \end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix} \end{aligned}$

1.5 旋转矩阵 $R_{\theta}^{\top}$ = $R_{- \theta}$

旋转矩阵 $\mathbf{R}_\theta$ 是正交矩阵，满足 $\mathbf{R}_\theta^\top = \mathbf{R}_{-\theta}$ 。具体推导过程如下：

正交矩阵的定义：
一个矩阵 $\mathbf{Q}$ 是正交矩阵，如果它的转置等于它的逆，即：
$\mathbf{Q}^\top = \mathbf{Q}^{-1}$
这意味着 $\mathbf{Q}^\top \mathbf{Q} = \mathbf{I}$ ，其中 $\mathbf{I}$ 是单位矩阵。
旋转矩阵的形式：
在二维空间中，旋转矩阵 $\mathbf{R}_\theta$ 用来将一个向量逆时针旋转角度 $\theta$ ，其形式为：
$\mathbf{R}_\theta = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}$
计算旋转矩阵的转置：
$\mathbf{R}_\theta^\top = \begin{bmatrix} \cos\theta & \sin\theta \\ -\sin\theta & \cos\theta \end{bmatrix}$
计算旋转矩阵的逆：
将 $-\theta$ 代入旋转矩阵的定义，得到：
$\mathbf{R}_{-\theta} = \begin{bmatrix} \cos(-\theta) & -\sin(-\theta) \\ \sin(-\theta) & \cos(-\theta) \end{bmatrix}$
利用三角函数的性质 $\cos(-\theta) = \cos\theta$ 和 $\sin(-\theta) = -\sin\theta$ ，所以：
$\mathbf{R}_{-\theta} = \begin{bmatrix} \cos\theta & \sin\theta \\ -\sin\theta & \cos\theta \end{bmatrix}$
比较转置和逆：
$\mathbf{R}_\theta^\top = \mathbf{R}_{-\theta}$
这说明旋转矩阵的转置等于其逆矩阵，因此旋转矩阵是正交矩阵。

2. 推导过程

2.1 目标

我们希望通过旋转位置编码（RoPE）来将绝对位置编码转化为相对位置编码。在这个过程中，核心操作是通过旋转来使得位置编码携带相对位置信息。

2.2 假设

为了实现位置编码的相对性，我们假设有如下操作：

$f (q, m)$ 和 $f (k, n)$ 分别是对查询向量 $q$ 和键向量 $k$ 的位置编码操作，目的是使得它们携带位置 $m$ 和 $n$ 的信息。
假设它们满足内积恒等式：
$\langle f(q, m), f(k, n) \rangle = g(q, k, m - n)$
其中， $g (q, k, m - n)$ 表示一个与 $m - n$ 相关的函数。

2.3 复数的引入

通过复数来表示向量，我们可以将问题简化。对于复数，内积可以表示为：
$\langle q, k \rangle = \mathrm{Re}(q k^*)$
其中， $q^*$ 是 $q$ 的共轭复数。

因此，我们有：
$\mathrm{Re}(f(q, m) f^*(k, n)) = g(q, k, m - n)$

2.4 使用复数的指数形式

接下来，我们使用复数的指数形式来表示 $f (q, m)$ 和 $f (k, n)$ ：
$R_f(q, m) e^{i \Theta_f(q, m)}, \quad f(k, n) = R_f(k, n) e^{i \Theta_f(k, n)}, \quad g(q, k, m - n) = R_g(q, k, m - n) e^{i \Theta_g(q, k, m - n)}$

代入方程后得到：
$R_f(q, m) R_f(k, n) = R_g(q, k, m - n), \quad \Theta_f(q, m) - \Theta_f(k, n) = \Theta_g(q, k, m - n)$

2.5 求解模长

为了使得内积恒等式成立，我们要求解模长部分。代入 $m = n$ 后，我们得到：
$R_f(q, m) R_f(k, n) = R_g(q, k, 0) = R_f(q, 0) R_f(k, 0) = \| q \| \| k \|$
所以我们可以设定：
$R_f(q, m) = \| q \|, \quad R_f(k, n) = \| k \|$
即它们的模长不依赖于 $m$ 和 $n$ 。

2.6 求解角度

接下来，我们求解角度部分。代入 $m = n$ 后得到：
$\Theta_f(q, m) - \Theta_f(k, m) = \Theta_g(q, k, 0) = \Theta_f(q, 0) - \Theta_f(k, 0) = \Theta(q) - \Theta(k)$
其中， $Θ (q)$ 和 $Θ (k)$ 是 $q$ 和 $k$ 本身的幅角。由此，我们得到：
$\Theta_f(q, m) - \Theta(q) = \Theta_f(k, m) - \Theta(k)$
即 $\Theta_f(q, m) - \Theta(q)$ 是一个只与 $m$ 相关、与 $q$ 无关的函数，记为 $φ (m)$ 。因此， $\Theta_f(q, m)$ 可以表示为：
$\Theta_f(q, m) = \Theta(q) + \varphi(m)$

2.7 得到解

接下来，代入 $n = m - 1$ ，我们可以得到：
$\varphi(m) - \varphi(m - 1) = \Theta_g(q, k, 1) + \Theta(k) - \Theta(q)$
即 $\varphi(m) }$ 是等差数列，设右端为 $θ$ ，我们得到：
$\varphi(m) = m \theta$

2.8 二维编码形式

在二维情形下，RoPE 可以表示为：
$R_f(q, m) e^{i \Theta_f(q, m)} = \| q \| e^{i (\Theta(q) + m \theta)} = q e^{im \theta}$

根据复数乘法的几何意义，该变换实际上对应着向量的旋转，所以我们称之为“旋转式位置编码(ROPE)”，它还可以写成矩阵形式：

$\begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix} \begin{pmatrix} q_0 \\ q_1 \end{pmatrix}$

公式的具体推导过程，可以参考前置数学知识里 1.4 的旋转矩阵公式里的推导过程。

2.8.1 推导公式

已知向量 $\mathbf{q}$ 、 $\mathbf{k}$ ，且它们的位置分别为 $m$ 、 $n$

公式 $(\mathbf{R}_m \mathbf{q})^\top (\mathbf{R}_n \mathbf{k}) = \mathbf{q}^\top \mathbf{R}_{n-m} \mathbf{k}$ 的推导过程

展开内积：
$(\mathbf{R}_m \mathbf{q})^\top (\mathbf{R}_n \mathbf{k}) = \mathbf{q}^\top \mathbf{R}_m^\top \mathbf{R}_n \mathbf{k}$
利用旋转矩阵的性质：
旋转矩阵 $\mathbf{R}_\theta$ 是正交矩阵，满足 $\mathbf{R}_\theta^\top = \mathbf{R}_{-\theta}$ （具体证明过程可看1.6的内容）。因此：
$\mathbf{R}_m^\top = \mathbf{R}_{-m}$
代入旋转矩阵的性质：
$\mathbf{q}^\top \mathbf{R}_m^\top \mathbf{R}_n \mathbf{k} = \mathbf{q}^\top \mathbf{R}_{-m} \mathbf{R}_n \mathbf{k}$
利用旋转矩阵的加法性质：
旋转矩阵的乘积对应角度的相加，即 $\mathbf{R}_{-m} \mathbf{R}_n = \mathbf{R}_{n-m}$ 。因此：
$\mathbf{q}^\top \mathbf{R}_{-m} \mathbf{R}_n \mathbf{k} = \mathbf{q}^\top \mathbf{R}_{n-m} \mathbf{k}$

最终得到：
$(\mathbf{R}_m \mathbf{q})^\top (\mathbf{R}_n \mathbf{k}) = \mathbf{q}^\top \mathbf{R}_{n-m} \mathbf{k}$

解释

旋转矩阵的正交性：旋转矩阵 $\mathbf{R}_\theta$ 是正交矩阵，满足 $\mathbf{R}_\theta^\top \mathbf{R}_\theta = \mathbf{I}$ ，因此 $\mathbf{R}_\theta^\top = \mathbf{R}_{-\theta}$ 。
旋转矩阵的加法性质：旋转矩阵的乘积对应角度的相加，即 $\mathbf{R}_\theta \mathbf{R}_\phi = \mathbf{R}_{\theta+\phi}$ 。

这些性质在推导过程中起到了关键作用，使得我们可以将旋转矩阵的乘积简化为一个单一的旋转矩阵。

公式 $\mathbf{\mathbf{R}_m \mathbf{q} \cdot \mathbf{R}_n \mathbf{k} = \mathbf{q} \cdot \mathbf{k} \cdot \cos((m - n)\theta)}$ 的推导过程

利用内积的几何意义：

旋转矩阵保持向量的长度不变，即 $\|\mathbf{R}_m \mathbf{q}\| = \|\mathbf{q}\|$ 和 $\|\mathbf{R}_n \mathbf{k}\| = \|\mathbf{k}\|$ 。
旋转后的向量内积可以表示为：
$\mathbf{R}_m \mathbf{q} \cdot \mathbf{R}_n \mathbf{k} = \|\mathbf{R}_m \mathbf{q}\| \|\mathbf{R}_n \mathbf{k}\| \cos((m - n)\theta) = \|\mathbf{q}\| \|\mathbf{k}\| \cos((m - n)\theta)$
由于 $\mathbf{q} \cdot \mathbf{k} = \|\mathbf{q}\| \|\mathbf{k}\| \cos \alpha$ ，其中 $\alpha$ 是 $\mathbf{q}$ 和 $\mathbf{k}$ 之间的夹角，所以：
$\mathbf{R}_m \mathbf{q} \cdot \mathbf{R}_n \mathbf{k} = (\mathbf{q} \cdot \mathbf{k}) \cdot \cos((m - n)\theta)$
某些情况下上式也可化简为：
$\mathbf{R}_m \mathbf{q} \cdot \mathbf{R}_n \mathbf{k} = (\mathbf{q} \cdot \mathbf{k}) \cdot \cos(\theta)$

2.9 跟Sinusoidal公式比较

Sinusoidal的复数形式的公式如下：
$p_m = x + y i$
以向量形式表示为
$p_m = \begin{pmatrix} x \\ y \end{pmatrix}$
跟 RoPE 的公式比较可以看出来，RoPE 多了旋转操作，也就是左边多乘了 $\cos(m \theta) + i\sin(m \theta)$ ，因此 RoPE 在左边多了个旋转矩阵

2.10 多维情况

由于内积满足线性叠加性，因此任意偶数维的RoPE，我们都可以表示为二维情形的拼接，即：

$R_m = \begin{pmatrix} \cos(m\theta_0) & -\sin(m\theta_0) & 0 & 0 & \cdots & 0 & 0 \\ \sin(m\theta_0) & \cos(m\theta_0) & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \cos(m\theta_1) & -\sin(m\theta_1) & \cdots & 0 & 0 \\ 0 & 0 & \sin(m\theta_1) & \cos(m\theta_1) & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & \cos(m\theta_{\frac{d}{2} - 1}) & -\sin(m\theta_{\frac{d}{2} - 1}) \\ 0 & 0 & 0 & 0 & \cdots & \sin(m\theta_{\frac{d}{2} - 1}) & \cos(m\theta_{\frac{d}{2} - 1}) \\ \end{pmatrix}$

$R_m \begin{pmatrix} q_0 \\ q_1 \\ q_2 \\ q_3 \\ \vdots \\ q_{d-2} \\ q_{d-1} \end{pmatrix}$

也就是说，给位置为 $m$ 的向量 $q$ 乘上矩阵 $R_m$ 、位置为 $n$ 的向量 $k$ 乘上矩阵 $R_n$ ，用变换后的 $Q$ 、 $K$ 序列做Attention，那么Attention就自动包含相对位置信息了，因为成立恒等式：

$(R_m q)^\top (R_n k) = q^\top R_m^\top R_n k = q^\top R_{m-n} k$

值得指出的是， $R_m$ 是一个正交矩阵，它不会改变向量的模长，因此通常来说它不会改变原模型的稳定性。

由于 $R_m$ 的稀疏性，所以直接用矩阵乘法来实现会很浪费算力，推荐通过下述方式来实现RoPE：

$\begin{pmatrix} q_0 \\ q_1 \\ q_2 \\ q_3 \\ \vdots \\ q_{d-2} \\ q_{d-1} \end{pmatrix} \odot \begin{pmatrix} \cos(m\theta_0) \\ \cos(m\theta_0) \\ \cos(m\theta_1) \\ \cos(m\theta_1) \\ \vdots \\ \cos(m\theta_{\frac{d}{2} - 1}) \\ \cos(m\theta_{\frac{d}{2} - 1}) \end{pmatrix} + \begin{pmatrix} -q_1 \\ q_0 \\ -q_3 \\ q_2 \\ \vdots \\ -q_{d-1} \\ q_{d-2} \end{pmatrix} \odot \begin{pmatrix} \sin(m\theta_0) \\ \sin(m\theta_0) \\ \sin(m\theta_1) \\ \sin(m\theta_1) \\ \vdots \\ \sin(m\theta_{\frac{d}{2} - 1}) \\ \sin(m\theta_{\frac{d}{2} - 1}) \end{pmatrix}$

变量说明：

$\mathbf{q} = \begin{pmatrix} q_0 \\ q_1 \\ q_2 \\ q_3 \\ \vdots \\ q_{d-2} \\ q_{d-1} \end{pmatrix}$ :
- 这是位置编码向量，其中 $q_0, q_1, \ldots, q_{d-1}$ 是位置编码向量的每一个分量。
- $d$ 是位置编码的维度。通常 $d$ 为偶数，且每个维度对应一个不同的频率。
- 每个 $q_i$ 表示一个位置编码的分量，通常由某种方法（如正弦波或余弦波）生成，表示位置在某种空间中的表示。
$\theta_i$ :
- 这是旋转角度，控制了在不同维度上的位置编码旋转程度。旋转角度通常通过以下公式计算：
  $\theta_i = 10000^{\frac{-2i}{d}}$
- $\theta_i$ 控制了在每个维度上的旋转，通常在高维空间中，这种旋转带来远程位置编码的衰减效应，即随着维度的增加，远离当前位置的编码向量的内积趋近于0。
- $i$ 表示第 $i$ 个旋转角度，它跟偶数维向量的第 $j$ 个维度的关系是 $\left\lfloor \frac{j}{2} \right\rfloor$
- $d$ 是位置编码向量的总维度。
$m$ :
- $m$ 是一个常数，它是输入序列里的第 $m$ 个位置。
$\odot$ :
- 这是表示逐位对应相乘的符号，也称为哈达玛积（Hadamard Product）。即两个向量中的对应元素按元素相乘。