当前位置：首页 > article >正文

LLM - 理解多模态大模型 Qwen2-VL 的 NDR 与 M-RoPE 教程

article 2025/2/11 12:09:40

欢迎关注我的CSDN：https://spike.blog.csdn.net/
本文地址：https://spike.blog.csdn.net/article/details/145551027

Qwen2-VL 是多模态语言模型，在自然语言处理和视觉理解领域展现出卓越的性能，通过深度融合语言和视觉信息，高效地处理图文混合输入，精准理解图像内容，以及生成与之相关的高质量文本描述。

Paper: Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution

目前排名(2025.2.7)：
Qwen2-VL

模型框架：
Qwen2-VL

1. 概述

Qwen2-VL 系列，是 Qwen-VL 模型的升级，主要包括 4 个改进，即：

引入 原始动态分辨率(Naive Dynamic Resolution) 机制，动态处理不同分辨率的图像，转换为不同数量的视觉标记。
集成 多模态旋转位置嵌入(Multimodal Rotary Position Embedding, M-RoPE)，在文本、图像和视频之间融合位置信息。
使用 统一范式(Unified Paradigm) 处理图像和视频，增强模型的视觉感知能力。
探索大型视觉语言模型(Large Vision-Language Models, LVLMs) 的 Scaling Laws。

多模态大模型的常见结构，预测下一个 Token，网络如下：
$\ encoder \to cross\text{-}modal \ connector \to LLM$
其中，LLM 的框架包括 Dense 和 MoE 两种类型。

之前 VLM：

之前 VLM 的图像输入，只支持 $224 \times 224$ ，划分成 $14 \times 14$ 的小 Patch，即包括 $16 \times 16=196$ 个 Patch，即 196 个 Tokens。
之前 VLM 的视觉编码器，使用冻结(frozen) 的 CLIP 模型的 ViT。

2. 创新

Qwen2-VL 在 ViT 中使用 3D-RoPE，扩展至不同的空间尺寸，即 原始动态分辨率(Naive Dynamic Resolution)，其中，视觉编码器都是 675M，LLM 规模不同(1.5B/7.6B/72B)。

视觉编码器 ViT 使用 DFN(Data Filtering Networks, 数据过滤网络) ViT 权重初始化，位置编码替换为 3D-RoPE。
Patch 尺寸同样是 $14 \times 14$ ，每 4 个Token，使用 MLP 压缩至 1 个 Token，例如图像 $28 \times 224$ ，包括 $\times 16 = 32$ 个 Patch，压缩之后是 8 个 Token。
图像保持原有的长宽比，不再使用最大边缩放。
视觉使用 2 个特殊 Tokens，表示视觉序列的开始和结束，即 <|vision_start|> 和 <|vision_end|>

DFN(Data Filtering Networks, 数据过滤网络) 是用于筛选大规模未整理数据集的小型神经网络，从海量的网络数据中，筛选出高质量的训练数据，以用于预训练模型，DFN ViT(Vision Transformer) 使用 DFN 筛选的数据集训练 ViT 模型。

2.1 MLP 压缩

其中，MLP 压缩模块，即 PatchMerger，图像 Token 维度是 1280，4 个 Token 合并 1 组，即 $\times 1280=5120$ 维，先做一次线性映射，再对齐 1 个文本 Token 维度，即 3584 维 (Qwen2-7B)，即：

(merger): PatchMerger(
  (ln_q): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
  (mlp): Sequential(
    (0): Linear(in_features=5120, out_features=5120, bias=True)
    (1): GELU(approximate='none')
    (2): Linear(in_features=5120, out_features=3584, bias=True)
  )
)

2.2 M-RoPE

其中，3D-RoPE(Multimodal RoPE) 的逻辑如下：

以经典的时间、高、宽(thw) 为例，构建 3 维的 position_ids，增加 bs 维度，即 $[t h w, b s, s] = [3, 2, 10]$
position_ids ( $[3, 2, 1, 10]$ ) 与 inv_freq(half dim $[3, 1, 10, 64]$ )，矩阵相乘( $[3, 2, 10, 64]$ )，再拼接2次，计算 cos 和 sin，组成与 qk 相同的维度 $[3, 2, 10, 128]$
以 mrope_section 将 cos 和 sin，划分 2 个 3 份(thw)，合计 6 份，即 $[[3, 2, 10, 16], [3, 2, 10, 24], [3, 2, 10, 24], [3, 2, 10, 16], [3, 2, 10, 24], [3, 2, 10, 24]]$
再查询各自的维度(0/1/2)，即 $[[2, 10, 16], [2, 10, 24], [2, 10, 24], [2, 10, 16], [2, 10, 24], [2, 10, 24]]$
再做合并(concat)，扩展 head 维，第2维，即 $[2, 1, 10, 128]$
最后计算，引入 $rotate_{half} \to(-x_{2}, x_{1})$ ，参考公式：

$q_{embed} = (q * cos) + (rotate_{half}(q) * sin) \\ k_{embed} = (k * cos) + (rotate_{half}(k) * sin)$

源码：

import torch
def multimodal_rotary_pos_emb(position_ids, dim, theta=10000.0):
    # dim=128, inv_freq=[64]
    inv_freq = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim))  # 逆频率，即频率倒数
    print(f"inv_freq: {inv_freq.shape}")
    # inv_freq_expanded=[3, 10, 64, 1]
    bs = position_ids.size(1)
    inv_freq_expanded = inv_freq[None, None, :, None].float().expand(3, bs, -1, 1)
    print(f"inv_freq_expanded: {inv_freq_expanded.shape}")

    # position_ids_expanded=[3, 2, 1, 10]
    position_ids_expanded = position_ids[:, :, None, :].float()
    print(f"position_ids_expanded: {position_ids_expanded.shape}")

    # freqs=[3, 2, 10, 64]
    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(2, 3)
    print(f"freqs: {freqs.shape}")
    emb = torch.cat((freqs, freqs), dim=-1)  # [3, 2, 10, 128]
    cos = emb.cos()  # [3, 2, 10, 128]
    sin = emb.sin()  # [3, 2, 10, 128]
    print(f"emb: {emb.shape}, cos: {cos.shape}, sin: {sin.shape}")
    return sin, cos
def rotate_half(x):
    """Rotates half the hidden dims of the input."""
    x1 = x[..., :x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2:]
    return torch.cat((-x2, x1), dim=-1)
def apply_multimodal_rotary_pos_emb(q, k, cos, sin, mrope_section, unsqueeze_dim=1):
    """Applies Rotary Position Embedding with Multimodal Sections to the query and key tensors."""
    # mrope_section 用于确定图像维度(例如128维)中，时间、高度、宽度，不同占比 thw
    # mrope_section 默认是[16,24,24]，和是64，2倍是128, 正好是1280/10=128，10个head的维度
    # mrope_section=[16, 24, 24] sum(mrope_section) == dim//2
    # sim = cos = [thw, bs, seq, dim]
    mrope_section = mrope_section * 2  # sum([16, 24, 24, 16, 24, 24]) = 128
    # [[3, 2, 10, 16], [3, 2, 10, 24], [3, 2, 10, 24], [3, 2, 10, 16], [3, 2, 10, 24], [3, 2, 10, 24]]
    cos_split = cos.split(mrope_section, dim=-1)
    sin_split = sin.split(mrope_section, dim=-1)
    print("cos_split: ", [list(x.size()) for x in cos_split])  # cos_split=[16, 24, 24, 16, 24, 24]

    # 获取 thw 各自的维度: [[2, 10, 16], [2, 10, 24], [2, 10, 24], [2, 10, 16], [2, 10, 24], [2, 10, 24]]
    cos_group = [m[i % 3] for i, m in enumerate(cos_split)]
    sin_group = [m[i % 3] for i, m in enumerate(sin_split)]
    print("cos_group: ", [list(x.size()) for x in cos_group])
    # cos=[2, 1, 10, 128]
    cos = torch.cat(cos_group, dim=-1).unsqueeze(unsqueeze_dim)  # unsqueeze 是 bs 维度
    sin = torch.cat(sin_group, dim=-1).unsqueeze(unsqueeze_dim)
    # [2, 8, 10, 128]*[2, 1, 10, 128]=[2, 8, 10, 128]
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed
def main():
    bs, h, s, d = 2, 8, 10, 128
    thw = 3  # Time, Height, Width
    mrope_section = [16, 24, 24]  # [16, 24, 24] sum(mrope_section) == dim//2
    q = torch.randn(bs, h, s, d)  # [2, 8, 10, 128]
    k = torch.randn(bs, h, s, d)
    print("q:", q.shape)
    print("k:", k.shape)
    position_ids = torch.arange(s).view(1, 1, -1).expand(thw, bs, -1)  # [3, 2, 10]
    print(f"position_ids: {position_ids.shape}")
    sin, cos = multimodal_rotary_pos_emb(position_ids, d, theta=10000.0)
    # rh: tensor([[-5, -6, -7, -8, -9,  0,  1,  2,  3,  4]])
    rh = rotate_half(torch.arange(10).unsqueeze(0))
    print(f"rh: {rh}")
    q_, k_ = apply_multimodal_rotary_pos_emb(q, k, cos, sin, mrope_section, unsqueeze_dim=1)
    print(f"q_: {q_.shape}")  # ([2, 8, 10, 128])
    print(f"k_: {k_.shape}")  # ([2, 8, 10, 128])
if __name__ == '__main__':
    main()

3. 模型训练

模型训练： 与 Qwen-VL 一致，Qwen2-VL 使用 3 阶段训练方法。

模型在包含 (1)图像-文本对、(2)OCR数据、(3)交错的图像-文本文章、(4)VQA数据集、(5)视频对话以及(6)图像知识数据集的多样化数据集上，预训练。我数据来源主要包括清理后的网页、开源数据集以及合成数据。数据知识的截止日期是2023年6月。多样化的数据组合，对于发展强大的多模态理解能力至关重要。

初始预训练(Initial Pre-Training) 阶段：Qwen2-VL 使用 600B Tokens 的语料库(Corpus)，LLM 组件使用 Qwen2 参数初始化，视觉编码器(Vision Encoder) 使用 DFN((Data Filtering Networks, 数据过滤网络) ViT 初始化，固定的绝对位置嵌入(Fixed Position Embedding) 替换为 RoPE-3D。主要专注于学习图像与文本关系(Image-Text Relationships)、OCR、图像分类(Image Classification)，基础训练使模型发展出对于核心视觉与文本相关性，以及对齐稳健理解。
预训练(Pre-Training)阶段： 使用额外 800B Tokens，使用更大数量的混合图像文本内容，有助于模型更细致地理解视觉与文本之间的相互作用。视觉问答(VQA, Visual Question Answering) 数据集提升模型，对于图像相关问题的响应。多任务数据集对于发展模型同时处理多样化任务的能力至关重要。同时，纯文本数据在维持和提升模型的语言能力也很重要。
指令微调(Instruction Fine-Tuning) 阶段： 在指令微调阶段，采用 ChatML（OpenAI）格式构建遵循指令的数据。该数据集不仅包括纯文本对话数据，还包括多模态对话数据。多模态部分包括图像问答、文档解析、多图像比较、视频理解、视频流对话以及基于代理的交互。我们全面的数据构建方法旨在增强模型在各种模态中理解和执行广泛指令的能力。通过整合多样化的数据类型，我们致力于开发一种更灵活且强大的语言模型，使其能够处理复杂的多模态任务，以及传统的基于文本的交互。

4. 其他

例如 Qwen2-VL-7B-Instruct 的网络参数：

Qwen2VLForConditionalGeneration(
  (visual): Qwen2VisionTransformerPretrainedModel(
    (patch_embed): PatchEmbed(
      (proj): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False)
    )
    (rotary_pos_emb): VisionRotaryEmbedding()
    (blocks): ModuleList(
      (0-31): 32 x Qwen2VLVisionBlock(
        (norm1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        (norm2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        (attn): VisionSdpaAttention(
          (qkv): Linear(in_features=1280, out_features=3840, bias=True)
          (proj): Linear(in_features=1280, out_features=1280, bias=True)
        )
        (mlp): VisionMlp(
          (fc1): Linear(in_features=1280, out_features=5120, bias=True)
          (act): QuickGELUActivation()
          (fc2): Linear(in_features=5120, out_features=1280, bias=True)
        )
      )
    )
    (merger): PatchMerger(
      (ln_q): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
      (mlp): Sequential(
        (0): Linear(in_features=5120, out_features=5120, bias=True)
        (1): GELU(approximate='none')
        (2): Linear(in_features=5120, out_features=3584, bias=True)
      )
    )
  )
  (model): Qwen2VLModel(
    (embed_tokens): Embedding(152064, 3584)
    (layers): ModuleList(
      (0-27): 28 x Qwen2VLDecoderLayer(
        (self_attn): Qwen2VLSdpaAttention(
          (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
          (k_proj): Linear(in_features=3584, out_features=512, bias=True)
          (v_proj): Linear(in_features=3584, out_features=512, bias=True)
          (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
          (rotary_emb): Qwen2VLRotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
          (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
          (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((3584,), eps=1e-06)
    (rotary_emb): Qwen2VLRotaryEmbedding()
  )
  (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
)

Qwen2-VL-7B-Instruct 的总参数量：[Info] parameters: 8291375616

视觉模型的参数量：[Info] parameters model.visual: 675759104
语言模型的参数量：[Info] parameters model.model: 7070619136 + [Info] parameters model.lm_head: 544997376

即：675759104(8.15%) + 7070619136(85.28%) + 544997376(6.57%) = 8291375616 = 8B

其他，参考 1D-RoPE，即：

def precompute_freqs_cis(seq_len, dim, theta=10000.0):
    """
    计算 freqs_cis, 即 频率(frequencies) + cis(cos isin)
    """
    half_dim = dim // 2  # RoPE的维度是极坐标，是dim的1/2
    freqs = 1.0 / (theta ** (torch.arange(0, half_dim) / half_dim))
    t = torch.arange(seq_len)  # type: ignore
    freqs = torch.outer(t, freqs)  # type: ignore
    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64
    return freqs_cis
def apply_rotary_emb(q, k, freqs_cis):
    # [2, 8, 10, 64] -> [2, 8, 10, 32] (complex)
    xq = torch.view_as_complex(q.reshape(*q.shape[:-1], -1, 2))    # 转换成 complex 形式
    xk = torch.view_as_complex(k.reshape(*k.shape[:-1], -1, 2))    # 转换成 complex 形式
    # [2, 8, 10, 32, 2] -> [2, 8, 10, 64]
    xq_out = torch.view_as_real(xq * freqs_cis).flatten(3)  # flatten 第3维度
    xk_out = torch.view_as_real(xk * freqs_cis).flatten(3)
    return xq_out, xk_out