当前位置：首页 > article >正文

vLLM 安装记录（含踩坑xformers）

article 2025/2/12 3:30:33

最近Deepseek很火爆，但是官网总是访问量过大无法顺利回答，我在v100 显卡上安装 Deepseek 1.5B的模型进行本地运行。

保证运行环境：

cuda 12.2

python 3.12

配置虚拟环境：

conda create -n deepseek python=3.12 -y
conda activate deepseek

pip 安装依赖包：

pip install modelscope==1.22.3
pip install openai==1.61.0
pip install tqdm==4.67.1
pip install transformers==4.48.2
pip install vllm==0.7.1

安装vllm的时候，报了下面的错误

warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
/root/.local/conda/envs/deepseek/lib/python3.12/site-packages/torch/utils/cpp_extension.py:426: UserWarning: There are no g++ version bounds defined for CUDA version 12.2
warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
building 'xformers._C' extension
creating build/temp.linux-x86_64-cpython-312/xformers/csrc/attention
creating build/temp.linux-x86_64-cpython-312/xformers/csrc/attention/autograd
creating build/temp.linux-x86_64-cpython-312/xformers/csrc/attention/cpu
creating build/temp.linux-x86_64-cpython-312/xformers/csrc/attention/cuda
creating build/temp.linux-x86_64-cpython-312/xformers/csrc/sequence_parallel_fused
creating build/temp.linux-x86_64-cpython-312/xformers/csrc/sparse24
creating build/temp.linux-x86_64-cpython-312/xformers/csrc/swiglu/cuda
g++ -pthread -B /root/.local/conda/envs/deepseek/compiler_compat -fno-strict-overflow -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /root/.local/conda/envs/deepseek/include -fPIC -O2 -isystem /root/.local/conda/envs/deepseek/include -fPIC -I/tmp/pip-install-j6b_53gv/xformers_133a815df79f4ec1b130e6a64df22894/xformers/csrc -I/tmp/pip-install-j6b_53gv/xformers_133a815df79f4ec1b130e6a64df22894/third_party/sputnik -I/tmp/pip-install-j6b_53gv/xformers_133a815df79f4ec1b130e6a64df22894/third_party/cutlass/include -I/tmp/pip-install-j6b_53gv/xformers_133a815df79f4ec1b130e6a64df22894/third_party/cutlass/tools/util/include -I/tmp/pip-install-j6b_53gv/xformers_133a815df79f4ec1b130e6a64df22894/third_party/cutlass/examples -I/root/.local/conda/envs/deepseek/lib/python3.12/site-packages/torch/include -I/root/.local/conda/envs/deepseek/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/root/.local/conda/envs/deepseek/lib/python3.12/site-packages/torch/include/TH -I/root/.local/conda/envs/deepseek/lib/python3.12/site-packages/torch/include/THC -I/usr/local/cuda/include -I/root/.local/conda/envs/deepseek/include/python3.12 -c xformers/csrc/attention/attention.cpp -o build/temp.linux-x86_64-cpython-312/xformers/csrc/attention/attention.o -O3 -std=c++17 -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0
In file included from /root/.local/conda/envs/deepseek/lib/python3.12/site-packages/torch/include/c10/util/CallOnce.h:8:0,
from /root/.local/conda/envs/deepseek/lib/python3.12/site-packages/torch/include/ATen/Context.h:22,
from /root/.local/conda/envs/deepseek/lib/python3.12/site-packages/torch/include/ATen/ATen.h:7,
from /root/.local/conda/envs/deepseek/lib/python3.12/site-packages/torch/include/torch/csrc/api/include/torch/types.h:3,
from xformers/csrc/attention/attention.cpp:8:
/root/.local/conda/envs/deepseek/lib/python3.12/site-packages/torch/include/c10/util/C++17.h:13:2: error: #error "You're trying to build PyTorch with a too old version of GCC. We need GCC 9 or later."
#error \
^~~~~
error: command '/usr/bin/g++' failed with exit code 1
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for xformers
Running setup.py clean for xformers
Failed to build xformers
ERROR: Failed to build installable wheels for some pyproject.toml based projects (xformers)

这些报错是指gcc的版本有有问题，需要在conda环境里安装gcc

conda install gxx_linux-64

安装好之后，使用pip install xformers，会下载xformers.tat.gz，然后使用setup.py安装，但是这么尝试了几次，会一直卡在

Building wheels for collected packages: xformers
Building wheel for xformers (setup.py) /

所以可以使用whl的方法进行安装

Links for xformers

从上述链接里寻找合适版本的whl进行安装

但是此处又出现一个新的问题，下载的whl文件和系统版本不对应，比如我下载的

xformers-0.0.28.post3-cp312-cp312-manylinux_2_28_x86_64.whl，但是本身的系统不支持这个whl，所以需要修改为xformers-0.0.28.post3-cp312-cp312-manylinux2014_x86_64.whl

然后使用pip install xformers-0.0.28.post3-cp312-cp312-manylinux2014_x86_64.whl 即可。

最后安装PyTorch即可

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1

到此完成了vLLM的环境配置。

下载DeepSeek模型，本次下载为1.5B的最小模型，使用snapshot_download进行下载

新建python代码：

from modelscope import snapshot_download

model_dir = snapshot_download('deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', local_dir='/root/code/deepseek/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', revision='master')
# local_dir 是指本地的存储目录，这里可以按照文件夹目录写，也可以写绝对目录

在conda环境里运行python脚本，即可实现下载，下载速度不算慢，大概十分钟就下载好了。

下载完成后，可以找到模型如下图所示：

其中model.satetensors 是模型的权重，其他的json文件均为配置文件。

运行大模型需要使用LLM推理引擎，我们上面的操作已经下载好了模型，此处新建一个py脚本

# vllm_model.py
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
import os
import json
 
# 自动下载模型时，指定使用modelscope; 否则，会从HuggingFace下载
os.environ['VLLM_USE_MODELSCOPE']='True'
 
def get_completion(prompts, model, tokenizer=None, max_tokens=8192, temperature=0.6, top_p=0.95, max_model_len=2048):
    stop_token_ids = [151329, 151336, 151338]
    # 创建采样参数。temperature 控制生成文本的多样性，top_p 控制核心采样的概率
    sampling_params = SamplingParams(temperature=temperature, top_p=top_p, max_tokens=max_tokens, stop_token_ids=stop_token_ids)
    # 初始化 vLLM 推理引擎
    llm = LLM(model=model, tokenizer=tokenizer, max_model_len=max_model_len,trust_remote_code=True, dtype="float16")  # V100 不支持bfloat16 类型，需要设置为 float16
    outputs = llm.generate(prompts, sampling_params)
    return outputs
 
 
if __name__ == "__main__":  
    # 初始化 vLLM 推理引擎
    model='/root/code/deepseek/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B' # 指定模型路径
    # model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B" # 指定模型名称，自动下载模型
    tokenizer = None
    # 加载分词器后传入vLLM 模型，但不是必要的。
    # tokenizer = AutoTokenizer.from_pretrained(model, use_fast=False) 
  
    text = ["给我简单介绍一下清华大学<think>\n", ] # 可用 List 同时传入多个 prompt，根据 DeepSeek 官方的建议，每个 prompt 都需要以 <think>\n 结尾，如果是数学推理内容，建议包含（中英文皆可）：Please reason step by step, and put your final answer within \boxed{}.
 
    # messages = [
    #     {"role": "user", "content": prompt+"<think>\n"}
    # ]
    # 作为聊天模板的消息，不是必要的。
    # text = tokenizer.apply_chat_template(
    #     messages,
    #     tokenize=False,
    #     add_generation_prompt=True
    # )
 
    outputs = get_completion(text, model, tokenizer=tokenizer, max_tokens=8192, temperature=0.6, top_p=0.95, max_model_len=2048) # 思考需要输出更多的 Token 数，max_tokens 设为 8K，根据 DeepSeek 官方的建议，temperature应在 0.5-0.7，推荐 0.6
 
    # 输出是一个包含 prompt、生成文本和其他信息的 RequestOutput 对象列表。
    # 打印输出。
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        if r"</think>" in generated_text:
            think_content, answer_content = generated_text.split(r"</think>")
        else:
            think_content = ""
            answer_content = generated_text
        print(f"Prompt: {prompt!r}, Think: {think_content!r}, Answer: {answer_content!r}")

这个推理脚本来自于歌刎大佬，里面还有一些参数我还没有修改尝试，欢迎各位评论区讨论。

2025 最新 DeepSeek-R1-Distill-Qwen-7B vLLM 部署全攻略：从环境搭建到性能测试(V100-32GB)-CSDN博客

查看全文

http://www.kler.cn/a/541519.html