当前位置：首页 > article >正文

MLM之Qwen：Qwen2-VL的简介、安装和使用方法、案例应用之详细攻略

article 2025/2/1 15:59:12

Qwen2-VL的简介

1、主要增强功能：

2、模型架构更新：

3、性能

图像基准测试

视频基准测试

代理基准测试

多语言基准测试

4、新闻

5、限制

Qwen2-VL的安装和使用方法

1、安装

2、使用方法

(1)、使用Transformers进行聊天

(2)、ModelScope

更多使用提示

提高性能的图像分辨率

添加多个图像输入的ID

添加视觉ID

(4)、试试Qwen2-VL-72B的API！

3、量化

(1)、AWQ

使用Transformers的AWQ量化模型

(2)、GPTQ

使用 GPTQ 模型与 Transformers

4、基准测试

(1)、量化模型的性能

速度基准测试

5、部署

6、训练

LLaMA-Factory

安装

数据准备

训练

7、功能调用

(1)、简单用例：

8、演示

Web UI 示例

安装

使用 FlashAttention-2 运行演示

选择不同的模型（仅限 Qwen2-VL 系列）

定制化

9、Docker

Qwen2-VL的案例应用

Qwen2-VL的简介

2024年8越30日，阿里云重磅发布Qwen2-VL！Qwen2-VL是Qwen模型系列中最新版本的视觉语言模型。Qwen2-VL是由阿里云qwen2团队开发的多模态大型语言模型系列。

GitHub地址：https://github.com/QwenLM/Qwen2-VL

1、主要增强功能：

>> 各种分辨率和比例图像的SoTA理解： Qwen2-VL在视觉理解基准测试中实现了最先进的性能，包括MathVista、DocVQA、RealWorldQA、MTVQA等。

>> 理解超过20分钟的视频：通过在线流媒体能力，Qwen2-VL可以通过高质量的视频问答、对话、内容创作等方式理解超过20分钟的视频。

>> 可操作手机、机器人等设备的代理：具备复杂推理和决策能力的Qwen2-VL可以集成到如手机、机器人等设备中，基于视觉环境和文本指令自动操作。

>> 多语言支持：为了服务全球用户，除了支持英语和中文外，Qwen2-VL现在还支持图像中不同语言文本的理解，包括大多数欧洲语言、日语、韩语、阿拉伯语、越南语等。

2、模型架构更新：

>> 动态分辨率处理：与以往不同，Qwen2-VL可以处理任意图像分辨率，将其映射为动态数量的视觉标记，提供更人性化的视觉处理体验。

>> 多模态旋转位置嵌入（M-ROPE）：将位置嵌入分解为多个部分，以捕捉1D文本、2D视觉和3D视频的位置信息，增强其多模态处理能力。

我们开源了Qwen2-VL-2B和Qwen2-VL-7B，使用Apache 2.0许可证，并发布了Qwen2-VL-72B的API！该开源集成到Hugging Face Transformers、vLLM及其他第三方框架中。希望你喜欢！

3、性能

图像基准测试

Benchmark	Previous SoTA (Open-source LVLM)	Claude-3.5 Sonnet	GPT-4o	Qwen2-VL-72B (Coming soon)	Qwen2-VL-7B (🤗 🤖)	Qwen2-VL-2B (🤗🤖)
MMMUval	58.3	68.3	69.1	64.5	54.1	41.1
DocVQAtest	94.1	95.2	92.8	96.5	94.5	90.1
InfoVQAtest	82.0	-	-	84.5	76.5	65.5
ChartQAtest	88.4	90.8	85.7	88.3	83.0	73.5
TextVQAval	84.4	-	-	85.5	84.3	79.7
OCRBench	852	788	736	855	845	794
MTVQA	17.3	25.7	27.8	32.6	26.3	20.0
RealWorldQA	72.2	60.1	75.4	77.8	70.1	62.9
MMEsum	2414.7	1920.0	2328.7	2482.7	2326.8	1872.0
MMBench-ENtest	86.5	79.7	83.4	86.5	83.0	74.9
MMBench-CNtest	86.3	80.7	82.1	86.6	80.5	73.5
MMBench-V1.1test	85.5	78.5	82.2	85.9	80.7	72.2
MMT-Benchtest	63.4	-	65.5	71.7	63.7	54.5
MMStar	67.1	62.2	63.9	68.3	60.7	48.0
MMVetGPT-4-Turbo	65.7	66.0	69.1	74.0	62.0	49.5
HallBenchavg	55.2	49.9	55.0	58.1	50.6	41.7
MathVistatestmini	67.5	67.7	63.8	70.5	58.2	43.0
MathVision	16.97	-	30.4	25.9	16.3	12.4

视频基准测试

Benchmark	Previous SoTA (Open-source LVLM)	Gemini 1.5-Pro	GPT-4o	Qwen2-VL-72B (Coming soon)	Qwen2-VL-7B (🤗 🤖)	Qwen2-VL-2B (🤗🤖)
MVBench	69.6	-	-	73.6	67.0	63.2
PerceptionTesttest	66.9	-	-	68.0	62.3	53.9
EgoSchematest	62.0	63.2	72.2	77.9	66.7	54.9
Video-MME (wo/w subs)	66.3/69.6	75.0/81.3	71.9/77.2	71.2/77.8	63.3/69.0	55.6/60.4

代理基准测试

	Benchmark	Metric	Previous SoTA	GPT-4o	Qwen2-VL-72B
General	FnCall[1]	TM	-	90.2	93.1
		EM	-	50.0	53.2
Game	Number Line	SR	89.4[2]	91.5	100.0
	BlackJack	SR	40.2[2]	34.5	42.6
	EZPoint	SR	50.0[2]	85.5	100.0
	Point24	SR	2.6[2]	3.0	4.5
Android	AITZ	TM	83.0[3]	70.0	89.6
		EM	47.7[3]	35.3	72.1
AI2THOR	ALFREDvalid-unseen	SR	67.7[4]	-	67.8
		GC	75.3[4]	-	75.8
VLN	R2Rvalid-unseen	SR	79.0	43.7[5]	51.7
	REVERIEvalid-unseen	SR	61.0	31.6[5]	31.0

SR、GC、TM和EM分别表示成功率、目标条件成功、类型匹配和精确匹配。
>> 自主策划的功能调用基准测试（由Qwen团队）
>> 使用强化学习微调大型视觉语言模型作为决策代理
>> Zoo中的安卓：GUI代理的链式动作思维
>> ThinkBot：具有思维链推理的具身指令跟随
>> MapGPT：基于地图引导的提示与适应性路径规划，用于视觉和语言导航

多语言基准测试

这些结果在MTVQA基准测试上进行了评估。

Models	AR	DE	FR	IT	JA	KO	RU	TH	VI	AVG
Qwen2-VL-72B	20.7	36.5	44.1	42.8	21.6	37.4	15.6	17.7	41.6	32.6
GPT-4o	20.2	34.2	41.2	32.7	20.0	33.9	11.5	22.5	34.2	27.8
Claude3 Opus	15.1	33.4	40.6	34.4	19.4	27.2	13.0	19.5	29.1	25.7
Gemini Ultra	14.7	32.3	40.0	31.8	12.3	17.2	11.8	20.3	28.6	23.2

4、新闻

2024.08.30：我们已发布Qwen2-VL系列。2B和7B模型现已发布，72B开源模型即将推出。

5、限制

尽管 Qwen2-VL 适用于多种视觉任务，但同样重要的是了解其局限性。以下是一些已知的限制：
>> 缺乏音频支持：当前模型无法理解视频中的音频信息。
>> 数据的时效性：我们的图像数据集更新至 2023 年 6 月，此日期之后的信息可能未涵盖。
>> 个体和知识产权（IP）的限制：模型识别特定个体或知识产权的能力有限，可能无法全面覆盖所有知名人物或品牌。
>> 复杂指令的有限处理能力：在处理复杂的多步骤指令时，模型的理解和执行能力需要改进。
>> 计数精度不足：尤其是在复杂场景中，物体计数的精度不高，需要进一步改进。
>> 空间推理能力较弱：特别是在 3D 空间中，模型对物体位置关系的推断能力不足，难以准确判断物体的相对位置。
这些限制为模型优化和改进提供了持续的方向，我们致力于不断提升模型的性能和应用范围。

Qwen2-VL的安装和使用方法

1、安装

下面，我们提供了一些简单的例子，展示如何使用Qwen2-VL与�� ModelScope和�� Transformers。
Qwen2-VL的代码已在最新的Hugging Face Transformers中，我们建议你从源码构建，使用以下命令：

pip install git+https://github.com/huggingface/transformers accelerate

否则你可能会遇到以下错误：

KeyError: 'qwen2_vl'

我们提供了一个工具包，可以帮助你更方便地处理各种类型的视觉输入，就像使用API一样。这包括base64、URL和交错的图像和视频。你可以使用以下命令安装它：

pip install qwen-vl-utils

2、使用方法

(1)、使用Transformers进行聊天

这里我们展示了如何使用transformers和qwen_vl_utils进行聊天模型的代码片段。

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

(2)、ModelScope

我们强烈建议用户，特别是中国大陆的用户使用ModelScope。snapshot_download可以帮助你解决下载检查点的问题。

提高性能的图像分辨率

该模型支持多种分辨率输入。默认情况下，它使用输入的原生分辨率，但更高的分辨率可以提高性能，代价是更多的计算量。用户可以设置最小和最大像素数量，以实现其需求的最佳配置，例如256-1280的标记数量范围，以平衡速度和内存使用。

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
)

此外，我们提供了两种方法来对模型的图像大小输入进行精细控制：
>> 指定确切尺寸：直接设置resized_height和resized_width。这些值将四舍五入为28的最接近倍数。
>> 定义min_pixels和max_pixels：图像将被调整大小以在min_pixels和max_pixels的范围内保持其纵横比。

添加多个图像输入的ID

默认情况下，图像和视频内容直接包含在对话中。在处理多张图像时，为图像和视频添加标签有助于更好的参考。用户可以通过以下设置控制这种行为：

添加视觉ID

Flash-Attention 2加速生成

首先，请确保安装最新版本的Flash Attention 2：

pip install -U flash-attn --no-build-isolation

此外，你应该有与Flash-Attention 2兼容的硬件。请阅读flash attention仓库的官方文档了解更多信息。FlashAttention-2只能在模型加载为torch.float16或torch.bfloat16时使用。

要使用Flash Attention-2加载和运行模型，只需在加载模型时添加attn_implementation="flash_attention_2"，如下所示：

from transformers import Qwen2VLForConditionalGeneration

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", 
    torch_dtype=torch.bfloat16, 
    attn_implementation="flash_attention_2",
)

(4)、试试Qwen2-VL-72B的API！

为了探索更有趣的多模态模型Qwen2-VL-72B，我们鼓励你测试我们最先进的API服务。让我们现在开始这段激动人心的旅程吧！

pip install dashscope


import dashscope
dashscope.api_key = "your_api_key"

messages = [{
    'role': 'user',
    'content': [
        {
            'image': "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
        },
        {
            'text': 'What are in the image?'
        },
    ]
}]
# The model name 'qwen-vl-max-0809' is the identity of 'Qwen2-VL-72B'.
response = dashscope.MultiModalConversation.call(model='qwen-vl-max-0809', messages=messages)
print(response)

更多用法，请参考阿里云的教程。

3、量化

对于量化模型，我们提供了两种类型的量化：AWQ和GPQ(��)。

(1)、AWQ

我们推荐使用AWQ与AutoAWQ。AWQ是指激活感知权重量化，一种对LLM低比特权重量化的硬件友好方法。AutoAWQ是一个易于使用的4位量化模型包。

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-7B-Instruct-AWQ",
#     torch_dtype="auto",
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct-AWQ", torch_dtype="auto", device_map="auto"
)

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct-AWQ", min_pixels=min_pixels, max_pixels=max_pixels
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

使用Transformers的AWQ量化模型

如果您希望将自己的模型量化为 AWQ 量化模型，我们建议您使用 AutoAWQ。建议通过安装源代码的方式来安装分支版本的包：

git clone https://github.com/kq-chen/AutoAWQ.git cd AutoAWQ pip install numpy gekko pandas pip install -e .

假设您已经基于 Qwen2-VL-7B 微调了一个模型。为了构建您自己的 AWQ 量化模型，您需要使用训练数据进行校准。以下是一个简单的示例供您运行：

from transformers import Qwen2VLProcessor from awq.models.qwen2vl import Qwen2VLAWQForConditionalGeneration # 指定量化的路径和超参数 model_path = "your_model_path" quant_path = "your_quantized_model_path" quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"} # 使用 AutoAWQ 加载您的处理器和模型 processor = Qwen2VLProcessor.from_pretrained(model_path) # 我们建议启用 flash_attention_2 以实现更好的加速和内存节省 # model = Qwen2VLAWQForConditionalGeneration.from_pretrained( # model_path, model_type="qwen2_vl", use_cache=False, attn_implementation="flash_attention_2" # ) model = Qwen2VLAWQForConditionalGeneration.from_pretrained( model_path, model_type="qwen2_vl", use_cache=False )

接下来，您需要准备用于校准的数据。您只需将样本放入一个列表中，每个样本都是一个典型的聊天消息，如下所示。您可以在内容字段中指定文本和图像，例如：

dataset = [ # 消息 0 [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me who you are."}, {"role": "assistant", "content": "I am a large language model named Qwen..."}, ], # 消息 1 [ { "role": "user", "content": [ {"type": "image", "image": "file:///path/to/your/image.jpg"}, {"type": "text", "text": "Output all text in the image"}, ], }, {"role": "assistant", "content": "The text in the image is balabala..."}, ], # 其他消息... ..., ]

在这里，我们仅使用了一个图像标题数据集作为示例。您应将其替换为自己的 SFT 数据集。

def prepare_dataset(n_sample: int = 8) -> list[list[dict]]: from datasets import load_dataset dataset = load_dataset( "laion/220k-GPT4Vision-captions-from-LIVIS", split=f"train[:{n_sample}]" ) return [ [ { "role": "user", "content": [ {"type": "image", "image": sample["url"]}, {"type": "text", "text": "generate a caption for this image"}, ], }, {"role": "assistant", "content": sample["caption"]}, ] for sample in dataset ] dataset = prepare_dataset()

然后将数据集处理为张量：

from qwen_vl_utils import process_vision_info text = processor.apply_chat_template( dataset, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(dataset) inputs = processor( text=text, images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", )

然后只需通过一行代码运行校准过程：

model.quantize(calib_data=inputs, quant_config=quant_config)

最后，保存量化后的模型：

model.model.config.use_cache = model.model.generation_config.use_cache = True model.save_quantized(quant_path, safetensors=True, shard_size="4GB") processor.save_pretrained(quant_path)

这样您就可以获得自己的 AWQ 量化模型以进行部署了。尽情享受吧！

(2)、GPTQ

使用 GPTQ 模型与 Transformers

现在，Transformers 已经正式支持 AutoGPTQ，这意味着您可以直接使用经过量化的模型与 Transformers。下面是一个非常简单的代码片段，展示了如何使用量化模型运行 Qwen2-VL-7B-Instruct-GPTQ-Int4：

使用GPTQ模型与Transformers 现在，Transformers已正式支持AutoGPTQ，这意味着您可以直接使用量化后的模型与Transformers一起工作。以下是运行Qwen2-VL-7B-Instruct-GPTQ-Int4与量化模型的一个非常简单的代码片段：

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen2-VL-7B-Instruct-GPTQ-Int4",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen2-VL-7B-Instruct-GPTQ-Int4", torch_dtype="auto", device_map="auto"
)

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
    "Qwen2-VL-7B-Instruct-GPTQ-Int4", min_pixels=min_pixels, max_pixels=max_pixels
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

使用AutoGPTQ量化您自己的模型如果您想将您自己的模型量化为GPTQ量化模型，我们建议您使用AutoGPTQ。建议通过安装源代码来安装该包的分叉版本：

git clone https://github.com/kq-chen/AutoGPTQ.git
cd AutoGPTQ
pip install numpy gekko pandas
pip install -vvv --no-build-isolation -e .

假设您已经基于Qwen2-VL-7B微调了一个模型。要构建您自己的GPTQ量化模型，您需要使用训练数据来进行校准。下面，我们为您提供一个简单的演示来运行：

from transformers import Qwen2VLProcessor
from auto_gptq import BaseQuantizeConfig
from auto_gptq.modeling import Qwen2VLGPTQForConditionalGeneration

# Specify paths and hyperparameters for quantization
model_path = "your_model_path"
quant_path = "your_quantized_model_path"
quantize_config = BaseQuantizeConfig(
    bits=8,  # 4 or 8
    group_size=128,
    damp_percent=0.1,
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
    static_groups=False,
    sym=True,
    true_sequential=True,
)
# Load your processor and model with AutoGPTQ
processor = Qwen2VLProcessor.from_pretrained(model_path)
# We recommend enabling flash_attention_2 for better acceleration and memory saving
# model = Qwen2VLGPTQForConditionalGeneration.from_pretrained(model_path, quantize_config, attn_implementation="flash_attention_2")
model = Qwen2VLGPTQForConditionalGeneration.from_pretrained(model_path, quantize_config)

然后您需要准备您的数据用于校准。您需要做的就是把样本放入列表中，其中每一个都是如下面所示的标准聊天消息。您可以在content字段中指定文本和图像，例如：

dataset = [
    # message 0
    [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me who you are."},
        {"role": "assistant", "content": "I am a large language model named Qwen..."},
    ],
    # message 1
    [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": "file:///path/to/your/image.jpg"},
                {"type": "text", "text": "Output all text in the image"},
            ],
        },
        {"role": "assistant", "content": "The text in the image is balabala..."},
    ],
    # other messages...
    ...,
]

这里，我们仅为了演示目的使用了一个字幕数据集。您应该将其替换为您自己的sft数据集。

def prepare_dataset(n_sample: int = 20) -> list[list[dict]]:
    from datasets import load_dataset

    dataset = load_dataset(
        "laion/220k-GPT4Vision-captions-from-LIVIS", split=f"train[:{n_sample}]"
    )
    return [
        [
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": sample["url"]},
                    {"type": "text", "text": "generate a caption for this image"},
                ],
            },
            {"role": "assistant", "content": sample["caption"]},
        ]
        for sample in dataset
    ]


dataset = prepare_dataset()

然后将数据集处理成张量：

from qwen_vl_utils import process_vision_info


def batched(iterable, n: int):
    # batched('ABCDEFG', 3) → ABC DEF G
    assert n >= 1, "batch size must be at least one"
    from itertools import islice

    iterator = iter(iterable)
    while batch := tuple(islice(iterator, n)):
        yield batch


batch_size = 1
calib_data = []
for batch in batched(dataset, batch_size):
    text = processor.apply_chat_template(
        batch, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(batch)
    inputs = processor(
        text=text,
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    calib_data.append(inputs)

然后只需一行代码即可运行校准过程：

model.quantize(dataset, cache_examples_on_gpu=False)

最后，保存量化后的模型：

model.save_quantized(quant_path, use_safetensors=True)
processor.save_pretrained(quant_path)

这样您就可以获得自己的GPTQ量化模型以部署了。祝您使用愉快！

4、基准测试

(1)、量化模型的性能

本节报告了 Qwen2-VL 系列的量化模型（包括 GPTQ 和 AWQ）的生成性能。具体来说，我们报告了以下指标：

MMMU_VAL（准确率）

DocVQA_VAL（准确率）

MMBench_DEV_EN（准确率）

MathVista_MINI（准确率）

我们使用 VLMEvalkit 对所有模型进行评估。

速度基准测试

本节报告了 Qwen2-VL 系列 bf16 模型、量化模型（包括 GPTQ-Int4、GPTQ-Int8 和 AWQ）的速度性能。具体来说，我们报告了在不同上下文长度条件下的推理速度（tokens/s）以及内存占用（GB）。

使用 Huggingface Transformers 进行评估的环境是：

NVIDIA A100 80GB

CUDA 11.8

Pytorch 2.2.1+cu118

Flash Attention 2.6.1

Transformers 4.38.2

AutoGPTQ 0.6.0+cu118

AutoAWQ 0.2.5+cu118（autoawq_kernels 0.0.6+cu118）

注意：

我们使用批量大小为 1 并尽可能少的 GPU 数量进行评估。

我们测试了生成 2048 个 tokens 时，输入长度分别为 1、6144、14336、30720、63488 和 129024 tokens 的速度和内存。

5、部署

我们推荐使用 vLLM 进行快速 Qwen2-VL 部署和推理。您可以使用这个 fork（我们正在努力将此 PR 合并到 vLLM 主仓库）。

运行下面的命令来启动一个与 OpenAI 兼容的 API 服务：

然后，您可以使用以下 API 进行聊天（通过 curl 或 API）：

注意：现在 vllm.entrypoints.openai.api_server 不支持在消息中设置 min_pixels 和 max_pixels（我们正在努力支持此功能）。如果您想限制分辨率，可以在模型的 preprocessor_config.json 中设置它们：

您还可以使用 vLLM 本地推理 Qwen2-VL：

6、训练

LLaMA-Factory

这里我们提供了一个用于 LLaMA-Factory https://github.com/hiyouga/LLaMA-Factory 进行 Qwen2-VL 监督微调的脚本。这个监督微调（SFT）脚本具有以下特点：
>> 支持多图像输入；
>> 支持单 GPU 和多 GPU 训练；
>> 支持全参数调优和 LoRA。

以下是该脚本的使用细节。

安装

开始之前，请确保已安装以下软件包：

按照 LLaMA-Factory 的说明 https://github.com/hiyouga/LLaMA-Factory 构建环境。

安装这些软件包（可选）：

pip install deepspeed

pip install flash-attn --no-build-isolation

如果要使用 FlashAttention-2 https://github.com/Dao-AILab/flash-attention，请确保 CUDA 版本为 11.6 及以上。

数据准备

LLaMA-Factory 在数据文件夹中提供了几个训练数据集，您可以直接使用。如果您使用自定义数据集，请按以下方式准备您的数据集。

将数据组织在一个 JSON 文件中，并将数据放入数据文件夹中。LLaMA-Factory 支持 ShareGPT 格式的多模态数据集。ShareGPT 格式的数据集应遵循以下格式：

在 data/dataset_info.json 中提供您的数据集定义，格式如下。对于 ShareGPT 格式的数据集，dataset_info.json 中的列应为：

训练

LoRA SFT 示例：

llamafactory-cli train examples/train_lora/qwen2vl_lora_sft.yaml
llamafactory-cli export examples/merge_lora/qwen2vl_lora_sft.yaml

全量 SFT 示例：

llamafactory-cli train examples/train_full/qwen2vl_full_sft.yaml

推理示例：

llamafactory-cli webchat examples/inference/qwen2_vl.yaml
llamafactory-cli api examples/inference/qwen2_vl.yaml

执行以下训练命令：

享受训练过程。要更改您的训练，可以通过修改训练命令中的参数来调整超参数。需要注意的一个参数是 cutoff_len，即训练数据的最大长度。控制该参数以避免 OOM 错误。

7、功能调用

Qwen2-VL 支持功能调用（又称工具调用或工具使用）。有关如何使用此功能的详细信息，请参阅 Qwen-Agent 项目中的功能调用示例和代理示例。

(1)、简单用例：

# pip install qwen_agent
from typing import List, Union
from datetime import datetime
from qwen_agent.agents import FnCallAgent
from qwen_agent.gui import WebUI
from qwen_agent.tools.base import BaseToolWithFileAccess, register_tool

@register_tool("get_date")
class GetDate(BaseToolWithFileAccess):
    description = "call this tool to get the current date"
    parameters = [
        {
            "name": "lang",
            "type": "string",
            "description": "one of ['en', 'zh'], default is en",
            "required": False
        },
    ]

    def call(self, params: Union[str, dict], files: List[str] = None, **kwargs) -> str:
        super().call(params=params, files=files)
        params = self._verify_json_format_args(params)
        lang = "zh" if "zh" in params["lang"] else "en"
        now = datetime.now()
        result = now.strftime("%Y-%m-%d %H:%M:%S") + "\n"
        weekday = now.weekday()
        if lang == "zh":
            days_chinese = ["一", "二", "三", "四", "五", "六", "日"]
            result += "今天是星期" + days_chinese[weekday]
        else:
            days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
            result += "Today is " + days[weekday]
        return result


def init_agent_service():
    llm_cfg_vl = {
        # Using Qwen2-VL deployed at any openai-compatible service such as vLLM:
        "model_type": "qwenvl_oai",
        "model": "Qwen/Qwen2-VL-7B-Instruct",
        "model_server": "http://localhost:8000/v1",  # api_base
        "api_key": 'EMPTY",
    }
    tools = [
        "get_date",
        "code_interpreter",
    ]  # code_interpreter is a built-in tool in Qwen-Agent
    bot = FnCallAgent(
        llm=llm_cfg_vl,
        name="Qwen2-VL",
        description="function calling",
        function_list=tools,
    )
    return bot

def app_gui():
    # Define the agent
    bot = init_agent_service()
    WebUI(bot).run()

# Launch gradio app
app_gui()

8、演示

Web UI 示例

本节为用户提供了构建基于 Web 的用户界面（UI）演示的说明。此 UI 演示允许用户通过 Web 浏览器与预定义的模型或应用程序进行交互。按照以下步骤开始。

安装

在开始之前，请确保您的系统上已安装所需的依赖项。您可以通过运行以下命令来安装它们：

pip install -r requirements_web_demo.txt

使用 FlashAttention-2 运行演示

安装完所需的软件包后，您可以使用以下命令启动 Web 演示。此命令将启动一个 Web 服务器，并为您提供一个链接以在 Web 浏览器中访问 UI。

推荐：为了在多图像和视频处理场景中获得更好的性能和效率，我们强烈建议使用 FlashAttention-2。FlashAttention-2 在内存使用和速度方面提供了显著的改进，非常适合处理大规模模型和数据处理。

要启用 FlashAttention-2，请使用以下命令：

 web_demo_mm.py --flash-attn2

这将加载启用了 FlashAttention-2 的模型。

默认用法：如果您更喜欢不使用 FlashAttention-2 运行演示，或者如果您未指定 --flash-attn2 选项，演示将使用标准注意力实现加载模型：

 web_demo_mm.py

运行命令后，您将在终端看到一个类似这样的链接：

Running on local: http://127.0.0.1:7860/

复制此链接并将其粘贴到浏览器中，以访问 Web UI，您可以通过输入文本、上传图像或使用任何其他提供的功能与模型进行交互。

选择不同的模型（仅限 Qwen2-VL 系列）

演示默认配置为使用 Qwen/Qwen2-VL-7B-Instruct 模型，该模型是 Qwen2-VL 系列的一部分，非常适合各种视觉语言任务。但是，如果您想使用 Qwen2-VL 系列中的其他模型，只需在脚本中更新 DEFAULT_CKPT_PATH 变量：

定位 DEFAULT_CKPT_PATH 变量：在 web_demo_mm.py 文件中，找到定义模型检查点路径的 DEFAULT_CKPT_PATH 变量。它的格式应如下所示：

DEFAULT_CKPT_PATH = 'Qwen/Qwen2-VL-7B-Instruct'

替换为不同的 Qwen2-VL 模型路径：将 DEFAULT_CKPT_PATH 修改为指向 Qwen2-VL 系列中的另一个检查点路径。例如：

DEFAULT_CKPT_PATH = 'Qwen/Qwen2-VL-2B-Instruct'  # 示例：系列中的不同模型

保存并重新运行：修改路径后，保存脚本，然后根据上面“运行演示”部分中提供的说明重新运行演示。

注意：此 DEFAULT_CKPT_PATH 仅支持 Qwen2-VL 系列的模型。如果您使用的是 Qwen2-VL 系列之外的模型，可能需要对代码库进行其他更改。

定制化

通过修改 web_demo_mm.py 脚本，可以进一步自定义 Web 演示，包括 UI 布局、交互和其他功能（如处理特殊输入）。这种灵活性使您能够根据特定任务或工作流程调整 Web 界面。

9、Docker

为了简化部署过程，我们提供了带有预构建环境的 Docker 镜像：qwenllm/qwenvl。您只需要安装驱动程序并下载模型文件即可启动演示。

docker run --gpus all --ipc=host --network=host --rm --name qwen2 -it qwenllm/qwenvl:2-cu121 bash

Qwen2-VL的案例应用

持续更新中……

查看全文

http://www.kler.cn/a/286314.html

【AI绘画】MidJourney关键词{Prompt}全面整理

996引擎 - NPC-添加NPC引擎自带形象

5.桥模式(Bridge)

Linux 进程概念

手撕Diffusion系列 - 第十一期 - lora微调 - 基于Stable Diffusion（代码）

【美】H1B、F1、CPT、Day 1 CPT、OPT、B1/B2转F1 的核心区别及适用场景

比较一下React与Vue

《机器学习》—— K-means 聚类算法

【微处理器系统原理与应用设计】微处理器的基本架构之组成原理和系统结构

解决Qt报“undefined reference to vtable for“错误

科技改变搜索习惯：Anytxt Searcher，重新定义你的信息获取方式！

【王树森】Transformer模型(2/2): 从Attention层到Transformer网络（个人向笔记）

Java智慧社区全能平台集成跑腿家政及本地生活服务商城系统小程序源码

MySQL事务处理详解：实现数据一致性与隔离性的艺术

【分层强化学习】Option Critic 的 CartPole-v1 的简单实例

MATLAB 地面点构建三角网（83）

事务代码中加synchronized锁引发的bug

5.图论.题目2

MySQL索引分类

23. 如何使用Collections.synchronizedList()方法来创建线程安全的集合？有哪些注意事项？

浅析JavaScript 堆内存及其通过 Chrome DevTools 捕获堆快照的方法

SQL 注入之 Oracle 注入

springboot在线办公小程序论文源码调试讲解

学习日志8.30--防火墙NAT

【awk 】如何将一个文件按照同名字段进行合并？

【MySQL进阶】索引性能分析

Qwen2-VL的简介

1、主要增强功能：

2、模型架构更新：

3、性能

图像基准测试

视频基准测试

代理基准测试

多语言基准测试

4、新闻

5、限制

Qwen2-VL的安装和使用方法

1、安装

2、使用方法

(1)、使用Transformers进行聊天

(2)、ModelScope

更多使用提示

提高性能的图像分辨率

添加多个图像输入的ID

添加视觉ID

(4)、试试Qwen2-VL-72B的API！

3、量化

(1)、AWQ

使用Transformers的AWQ量化模型

(2)、GPTQ

使用 GPTQ 模型与 Transformers

4、基准测试

(1)、量化模型的性能

速度基准测试

5、部署

6、训练

LLaMA-Factory

安装

数据准备

训练

7、功能调用

(1)、简单用例：

8、演示

Web UI 示例

安装

使用 FlashAttention-2 运行演示

选择不同的模型（仅限 Qwen2-VL 系列）

定制化

9、Docker

Qwen2-VL的案例应用

相关文章：