当前位置：首页 > article >正文

《Long Context Compression with Activation Beacon》笔记

article 2025/1/22 11:15:19

Activation Beacon出自智源与人大在2024年1月放在arxiv上的论文《Long Context Compression with Activation Beacon》(v1版的题目：Soaring from 4K to 400K: Extending LLM’s Context with Activation Beacon）。它引入了Beacon token将上下文信息蒸馏到其激活(activations)；在压缩时将文本切分成固定大小的块(chunk)，并根据压缩比 $\alpha$ 进一步将chunk分成更小的单元，beacon token插入在每个单元后面；LLM每次编码一个chunk，在自注意力机制执行过程中将chunk的信息蒸馏到beacon token的激活信息(activation)中，逐步地对整个长文本完成压缩过程，论文实验结果表明此方法可以有效加速推理过程并节省KV cache内存占用。

WeChatWorkScreenshot_51ab7dcb-bd63-4c27-ba89-8c01af79d511

实现思路

如论文图1所示意，对输入文本 $[x_1, \ldots, x_n]$ ，将其划分为相同尺寸w(如1024)的chunk：
$[x_1, \ldots, x_n] \xrightarrow{\text{Partition}} [X_1, \ldots X_{\lceil n/w \rceil}], X_i=[x_{(i-1)w+1}, \ldots,x_{iw}] = [x^i_1, \ldots, x^i_w]$
对每一个chunk $X_i$ ，使用一个压缩比 $\alpha_i$ （w可由 $\alpha_i$ 整除），即将chunk划分到大小为 $\alpha$ 的更细粒度单元，一组共 $k_i=w/\alpha_i$ 个beacon token： $B_i=[\langle \mathbf{b} \rangle^i_1, \ldots, \langle \mathbf{b} \rangle^i_{k_i}]$ 被交替地插入到这些单元后。
$X_i \xrightarrow{\text{Interleave} \ B_i} X^{\prime}_i = [x^i_1, \ldots, x^i_{\alpha_i}, \langle \mathbf{b} \rangle^i_1, \ldots, x^i_{w-\alpha_i +1}, \ldots, x^i_w, \langle \mathbf{b} \rangle^i_{k_i}]$
LLM逐一地编码这些chunk，在自注意力机制过程中将每个chunk的信息压缩到beacon token的激活(activations)中，在编码了 $X^{\prime}_i$ 后，将 $X_i$ 的所有原始token(raw tokens)的激活信息给丢弃，但一直保留并累积beacon token $B_i$ 的激活信息；在编码下一个chunk $X^{\prime}_{i+1}$ 时，LLM将累积的beacon激活作为原始上下文 $X_{\le i}$ 的代理。

WeChatWorkScreenshot_6d367a38-54aa-4262-8f65-ffa19c9e7ddb

如论文图2所示，Activation Beacon与一般的LLM相比只做少许修改。对于第i个chunk $X^{\prime}_i$ ，编码过程可以写作：
$\operatorname{LLM}(\underbrace{\langle\mathbf{b}\rangle_1^i, \ldots,\langle\mathbf{b}\rangle_{k_{i-1}}^{i-1}}_{\text {beacon activations accumulated from } X_{<i}^{\prime}}, \underbrace{x_1^i, \ldots, x_{\alpha_i}^i,\langle\mathbf{b}\rangle_1^i, \ldots, x_{w-\alpha_i+1}^i, \ldots, x_w^i,\langle\mathbf{b}\rangle_{k_i}^i}_{\text {the current chunk } X_i^{\prime}}),$
也就是LLM的输入是前面chunk的激活累积和当前chunk需要被编码的token的混合物。设D表示LLM的隐藏层尺寸， $\boldsymbol{H} \in \mathbb{R}^{(w+k_i) \times D}$ 表示LLM任意层的self attention的输入隐藏状态。我们会区分raw token和beacon token：
$\mathbb{I}^r=\left\{j \mid x_j^i \neq\langle\mathbf{b}\rangle\right\}, \quad \mathbb{I}^b=\left\{j \mid x_j^i=\langle\mathbf{b}\rangle\right\} ; \quad \boldsymbol{H}^r=\boldsymbol{H}\left[\mathbb{I}^r\right], \quad \boldsymbol{H}^b=\boldsymbol{H}\left[\mathbb{I}^b\right] .$
将隐状态变成query, key, value：
$\begin{array}{lll} \boldsymbol{Q}^r=\boldsymbol{W}_Q^r \boldsymbol{H}^r, & \boldsymbol{K}^r=\boldsymbol{W}_K^r \boldsymbol{H}^r, & \boldsymbol{V}^r=\boldsymbol{W}_V^r \boldsymbol{H}^r, \\ \boldsymbol{Q}^b=\boldsymbol{W}_Q^b \boldsymbol{H}^b, & \boldsymbol{K}^b=\boldsymbol{W}_K^b \boldsymbol{H}^b, & \boldsymbol{V}^b=\boldsymbol{W}_V^b \boldsymbol{H}^b, \end{array}$
上式中 $\boldsymbol{W}^r_*$ 是LLM原来的投影矩阵， $\boldsymbol{W}^b_*$ 是新引入的只处理beacon token的投影矩阵。再将raw token和beacon token的query/key/value状态来得到 $\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V} \in \mathbb{R}^{(w+k_i) \times D}$
$\boldsymbol{Q}\left[\mathbb{I}^r\right]= \boldsymbol{Q}^r,\boldsymbol{Q}\left[\mathbb{I}^b\right]= \boldsymbol{Q}^b, \quad \boldsymbol{K}\left[\mathbb{I}^r\right]= \boldsymbol{K}^r,\boldsymbol{K}\left[\mathbb{I}^b\right]= \boldsymbol{K}^b, \quad \boldsymbol{V}\left[\mathbb{I}^r\right]= \boldsymbol{V}^r,\boldsymbol{V}\left[\mathbb{I}^b\right]= \boldsymbol{V}^b$
最后，用标准方法计算self attention:
$\boldsymbol{A} = \text{softmax}\left(\text{mask} \left( \frac{\boldsymbol{Q}\{\boldsymbol{K}^{ac}; \boldsymbol{K} \}^T }{\sqrt{D}} \right)\right), \quad \boldsymbol{V} = \boldsymbol{A}\{\boldsymbol{V}^{ac};\boldsymbol{V} \}$
上式中的 $\{ \cdot ; \cdot\}$ 表示矩阵连接， $\boldsymbol{K}^{ac}, \boldsymbol{V}^{ac} \in \mathbb{R}^{m_{i-1} \times D}$ 是从之前的chunk累积得到的beacon token的激活参数， $m_{i-1} = \sum^{i-1}_{j=1} k_j$ ， mask就是causal attention mask。在self attention过程中，所有的token与其他token进行交互，使得beacon tokens的key和value( $\boldsymbol{K}^{b}, \boldsymbol{V}^{b}$ )蒸馏了 $X_i$ 的上下文信息，它们会增量累积：
$\boldsymbol{K}^{ac} = \{\boldsymbol{K}^{ac}; \boldsymbol{K}^{b}\}, \boldsymbol{V}^{ac} = \{\boldsymbol{V}^{ac};\boldsymbol{V}^{b} \}$


### 下面代码是activation beacon在实现时，interleave插入beacon token的代码，位于model_beacon.py的Memory类的_step函数
						input_len = input_ids.shape[1]
            if beacon_size > 0:
                # insert beacon tokens in between raw tokens，对应论文中的式(2)
                input_ids_with_beacons = input_ids.new_full((input_ids.shape[0], input_len + beacon_size), self.beacon_token.item())
                raw_token_indices = torch.arange(input_ids_with_beacons.shape[1], device=input_ids.device)
                interleave_start_idx = compression_ratio - self._interleave_remainder
                raw_token_indices = raw_token_indices[raw_token_indices % (compression_ratio + 1) != interleave_start_idx].unsqueeze(0).expand_as(input_ids)
                input_ids_with_beacons = input_ids_with_beacons.scatter(dim=1, index=raw_token_indices, src=input_ids)
                input_ids = input_ids_with_beacons
                # attention mask
                ## beacon token是参与attention的，所以默认值为1
                attention_mask_with_beacons = attention_mask.new_full((attention_mask.shape[0], attention_mask.shape[1] + beacon_size), 1)
                attention_mask_with_beacons = attention_mask_with_beacons.scatter(dim=1, index=raw_token_indices, src=attention_mask)
                attention_mask = attention_mask_with_beacons
                # labels
                if labels is not None:
                    ## beacon token不参与loss的计算，所以标签为-100
                    labels_with_beacons = labels.new_full((labels.shape[0], labels.shape[1] + beacon_size), -100)
                    labels_with_beacons = labels_with_beacons.scatter(dim=1, index=raw_token_indices, src=labels)
                    labels = labels_with_beacons

训练过程

Activation Beacon的学习目标是在当前chunk上下文和之前压缩信息的条件下提高生成质量，损失函数如下：
$\min _{\boldsymbol{\Theta}^b} . \sum_{i=2}^{\lceil N / w\rceil} \sum_{j=1}^w \operatorname{Pr}\left(x_j^i \mid\langle\mathbf{b}\rangle_1^1, \ldots,\langle\mathbf{b}\rangle_{k_{i-1}}^{i-1}, x_1^i, \ldots x_{j-1}^i ; \mathbf{\Theta}, \boldsymbol{\Theta}^b\right) .$
上式中 $\mathbf{\Theta}$ 是LLM的参数，在训练过程中被冻结， $\mathbf{\Theta^b}$ 是每一层中beacon token对应的投影矩阵 $\boldsymbol{W}^b_*$ 和beacon token 的embedding $\mathbf{e}_{\langle b \rangle}$ （所有beacon token使用共享embedding )，训练时beacon token不参与损失计算（标签被设置为-100）因为它们仅用作压缩。

训练时第i个chunk的压缩比 $\alpha_i$ 是随机地从{2, 4, 8, 16, 32}中选取的，意在让模型灵活地支持不同的压缩粒度。而在推理时可以根据下游任务选择一个压缩比并应用到所有chunk。

训练过程分为预训练和微调，消融实验表明两个阶段对模型效果都有提升。

注意，activation beacon默认的方式是将其交替地插入在原始上下文中(代码中的interleave)，论文做消融实验时尝试将beacon token全部放在chunk的最后时效果是会下降的(代码中的append)。

WeChatWorkScreenshot_de754912-a289-4c7b-b3c9-a1d1b3331676

注：Activation Beacon与MemoRAG是同一个团队出的，理解这篇思路之后，就能更好地理解MemoRAG的记忆模型了。(对比这两篇论文对应的记忆模型的代码，几乎是一样的，有点奇怪为什么memorag没有引用这篇文章，也没有对代码做说明。因为不理解memorag的记忆模型的代码，通过搜索关键字beacon搜到了这篇论文)。

查看全文

http://www.kler.cn/a/513408.html