当前位置：首页 > article >正文

【杂记】vLLM多卡推理踩坑记录

article 2025/2/5 20:59:25

写在前面

仅作个人学习与记录用。主要记录vLLM在多卡推理时遇到的问题。

配置

vllm version: 0.5.1
GPU 0: Tesla V100-PCIE-16GB
GPU 1: Tesla V100-PCIE-16GB

问题一

ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-PCIE-16GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.

解决办法

查看官方文档

dtype – The data type for the model weights and activations. Currently, we support float32, float16, and bfloat16. If auto, we use the torch_dtype attribute specified in the model config file. However, if the torch_dtype in the config is float32, we will use float16 instead.

因此在代码中添加：

    llm = LLM(
        ...
        dtype='float16',
        ...
    )

问题二

在解决问题一后，当把vLLM.LLM中的参数tensor_parallel_size设置为大于1的值时（即设置多卡），程序会卡住并引发RuntimeError：

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method.

解决办法

按照 https://github.com/vllm-project/vllm/issues/6152中的回复，设置：

export VLLM_WORKER_MULTIPROC_METHOD=spawn

或在代码中添加：

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

问题三

解决问题二后，程序有可能会卡住并引发与新进程的引导阶段相关的RuntimeError：

RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
ERROR 11-28 20:12:42 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 3449547 died, exit code: 1
INFO 11-28 20:12:42 multiproc_worker_utils.py:123] Killing local vLLM worker processes

解决办法

按照 https://github.com/vllm-project/vllm/issues/5637 的解决办法，将spawn改成fork可能不会奏效，这是因为：

 It seems some tests will initialize cuda before launching vllm worker, which makes fork not possible.

此时解决方法可以尝试在任何可能对CUDA动手动脚的命令之前首先添加：

VLLM_WORKER_MULTIPROC_METHOD=spawn

或在代码最开头添加：

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

问题四

解决问题三后，重新运行，日志信息卡在：

(VllmWorkerProcess pid=166876)[0;0m INFO 11-25 20:57:27 pynccl.py:63] vLLM is using nccl==2.20.5

解决办法

按照 https://docs.vllm.ai/en/stable/getting_started/debugging.html 打开vllm日志：

export VLLM_LOGGING_LEVEL=DEBUG to turn on more logging.

export CUDA_LAUNCH_BLOCKING=1 to identify which CUDA kernel is causing the problem.

export NCCL_DEBUG=TRACE to turn on more logging for NCCL.

export VLLM_TRACE_FUNCTION=1 to record all function calls for inspection in the log files to tell which function crashes or hangs.

也可以在代码中添加：

os.environ["VLLM_LOGGING_LEVEL"]="DEBUG"
os.environ["NCCL_DEBUG"]="TRACE"
os.environ["VLLM_TRACE_FUNCTION"]="1"

获取详细报错信息。

如果没有更多报错信息，且你所遇到的问题与上述描述完全一致，可能是由于GPU间P2P通信不能正常工作。

按照 https://github.com/NVIDIA/nccl/issues/631 中的描述，可以尝试添加：

export NCCL_P2P_DISABLE=1

或在代码最开头添加：

os.environ["NCCL_P2P_DISABLE"]="1"

查看全文

http://www.kler.cn/a/430523.html

[STM32 标准库]EXTI应用场景功能框图寄存器

Airflow：深入理解Apache Airflow Task

【25考研】南开软件考研复试复习重点！

UE学习日志#21 C++笔记#7 基础复习7 string和string_view1

大年初六，风很大

Java设计模式：行为型模式→状态模式

VB.NET 从入门到精通：开启编程进阶之路

7_计算机网络五层体系结构

方案介绍|CW32L010安全低功耗MCU：驱动高速风筒新力量

day10性能测试（2）——Jmeter

fastadmin框架同时使用阿里云oss和阿里云点播

CRF（Conditional Random Fields，条件随机场）的输入数据形状通常取决于其应用场景和具体实现

java问题解决_idea导入java项目时包名路径报错解决

mysql，DBA面试题——2024年12月整理

qt之插件编译

STM32 中断系统掌握

【接口自动化测试】一文从3000字从0到1详解接口测试用例设计

在 Ubuntu 20.04 上安装和配置 Redis

最大子数组问题非蛮力算法

基于 LlamaFactory 微调大模型的实体识别的评估实现

docker离线安装及部署各类中间件（x86系统架构）

fastboot

centos7 离线安装7z

方案解读：46页数字化企业制造运营管理(MOM)系统运营实践方案

【人工智能】通过ChatGPT、Claude与通义千问 API 实现智能语料知识图谱的自动化构建（详细教程）

关于SpringBoot项目创建后构建总是失败的问题

【杂记】vLLM多卡推理踩坑记录

目录

写在前面

配置

问题一

解决办法

问题二

解决办法

问题三

解决办法

问题四

解决办法

相关文章：