vLLM 安装记录 (含踩坑xformers)
最近Deepseek很火爆,但是官网总是访问量过大无法顺利回答,我在v100 显卡上安装 Deepseek 1.5B的模型进行本地运行。
保证运行环境:
cuda 12.2
python 3.12
配置虚拟环境:
conda create -n deepseek python=3.12 -y
conda activate deepseek
pip 安装依赖包:
pip install modelscope==1.22.3
pip install openai==1.61.0
pip install tqdm==4.67.1
pip install transformers==4.48.2
pip install vllm==0.7.1
安装vllm的时候,报了下面的错误
warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
/root/.local/conda/envs/deepseek/lib/python3.12/site-packages/torch/utils/cpp_extension.py:426: UserWarning: There are no g++ version bounds defined for CUDA version 12.2
warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
building 'xformers._C' extension
creating build/temp.linux-x86_64-cpython-312/xformers/csrc/attention
creating build/temp.linux-x86_64-cpython-312/xformers/csrc/attention/autograd
creating build/temp.linux-x86_64-cpython-312/xformers/csrc/attention/cpu
creating build/temp.linux-x86_64-cpython-312/xformers/csrc/attention/cuda
creating build/temp.linux-x86_64-cpython-312/xformers/csrc/sequence_parallel_fused
creating build/temp.linux-x86_64-cpython-312/xformers/csrc/sparse24
creating build/temp.linux-x86_64-cpython-312/xformers/csrc/swiglu/cuda
g++ -pthread -B /root/.local/conda/envs/deepseek/compiler_compat -fno-strict-overflow -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /root/.local/conda/envs/deepseek/include -fPIC -O2 -isystem /root/.local/conda/envs/deepseek/include -fPIC -I/tmp/pip-install-j6b_53gv/xformers_133a815df79f4ec1b130e6a64df22894/xformers/csrc -I/tmp/pip-install-j6b_53gv/xformers_133a815df79f4ec1b130e6a64df22894/third_party/sputnik -I/tmp/pip-install-j6b_53gv/xformers_133a815df79f4ec1b130e6a64df22894/third_party/cutlass/include -I/tmp/pip-install-j6b_53gv/xformers_133a815df79f4ec1b130e6a64df22894/third_party/cutlass/tools/util/include -I/tmp/pip-install-j6b_53gv/xformers_133a815df79f4ec1b130e6a64df22894/third_party/cutlass/examples -I/root/.local/conda/envs/deepseek/lib/python3.12/site-packages/torch/include -I/root/.local/conda/envs/deepseek/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/root/.local/conda/envs/deepseek/lib/python3.12/site-packages/torch/include/TH -I/root/.local/conda/envs/deepseek/lib/python3.12/site-packages/torch/include/THC -I/usr/local/cuda/include -I/root/.local/conda/envs/deepseek/include/python3.12 -c xformers/csrc/attention/attention.cpp -o build/temp.linux-x86_64-cpython-312/xformers/csrc/attention/attention.o -O3 -std=c++17 -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0
In file included from /root/.local/conda/envs/deepseek/lib/python3.12/site-packages/torch/include/c10/util/CallOnce.h:8:0,
from /root/.local/conda/envs/deepseek/lib/python3.12/site-packages/torch/include/ATen/Context.h:22,
from /root/.local/conda/envs/deepseek/lib/python3.12/site-packages/torch/include/ATen/ATen.h:7,
from /root/.local/conda/envs/deepseek/lib/python3.12/site-packages/torch/include/torch/csrc/api/include/torch/types.h:3,
from xformers/csrc/attention/attention.cpp:8:
/root/.local/conda/envs/deepseek/lib/python3.12/site-packages/torch/include/c10/util/C++17.h:13:2: error: #error "You're trying to build PyTorch with a too old version of GCC. We need GCC 9 or later."
#error \
^~~~~
error: command '/usr/bin/g++' failed with exit code 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for xformers
Running setup.py clean for xformers
Failed to build xformers
ERROR: Failed to build installable wheels for some pyproject.toml based projects (xformers)
这些报错是指gcc的版本有有问题,需要在conda环境里安装gcc
conda install gxx_linux-64
安装好之后,使用pip install xformers,会下载xformers.tat.gz,然后使用setup.py安装,但是这么尝试了几次,会一直卡在
Building wheels for collected packages: xformers
Building wheel for xformers (setup.py) /
所以可以使用whl的方法进行安装
Links for xformers
从上述链接里寻找合适版本的whl进行安装
但是此处又出现一个新的问题,下载的whl文件和系统版本不对应,比如我下载的
xformers-0.0.28.post3-cp312-cp312-manylinux_2_28_x86_64.whl,但是本身的系统不支持这个whl,所以需要修改为xformers-0.0.28.post3-cp312-cp312-manylinux2014_x86_64.whl
然后使用pip install xformers-0.0.28.post3-cp312-cp312-manylinux2014_x86_64.whl 即可。
最后安装PyTorch即可
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1
到此完成了vLLM的环境配置。
下载DeepSeek模型,本次下载为1.5B的最小模型,使用snapshot_download进行下载
新建python代码:
from modelscope import snapshot_download
model_dir = snapshot_download('deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', local_dir='/root/code/deepseek/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', revision='master')
# local_dir 是指本地的存储目录,这里可以按照文件夹目录写,也可以写绝对目录
在conda环境里运行python脚本,即可实现下载,下载速度不算慢,大概十分钟就下载好了。
下载完成后,可以找到模型如下图所示:
其中model.satetensors 是模型的权重,其他的json文件均为配置文件。
运行大模型需要使用LLM推理引擎,我们上面的操作已经下载好了模型,此处新建一个py脚本
# vllm_model.py
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
import os
import json
# 自动下载模型时,指定使用modelscope; 否则,会从HuggingFace下载
os.environ['VLLM_USE_MODELSCOPE']='True'
def get_completion(prompts, model, tokenizer=None, max_tokens=8192, temperature=0.6, top_p=0.95, max_model_len=2048):
stop_token_ids = [151329, 151336, 151338]
# 创建采样参数。temperature 控制生成文本的多样性,top_p 控制核心采样的概率
sampling_params = SamplingParams(temperature=temperature, top_p=top_p, max_tokens=max_tokens, stop_token_ids=stop_token_ids)
# 初始化 vLLM 推理引擎
llm = LLM(model=model, tokenizer=tokenizer, max_model_len=max_model_len,trust_remote_code=True, dtype="float16") # V100 不支持bfloat16 类型,需要设置为 float16
outputs = llm.generate(prompts, sampling_params)
return outputs
if __name__ == "__main__":
# 初始化 vLLM 推理引擎
model='/root/code/deepseek/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B' # 指定模型路径
# model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B" # 指定模型名称,自动下载模型
tokenizer = None
# 加载分词器后传入vLLM 模型,但不是必要的。
# tokenizer = AutoTokenizer.from_pretrained(model, use_fast=False)
text = ["给我简单介绍一下清华大学<think>\n", ] # 可用 List 同时传入多个 prompt,根据 DeepSeek 官方的建议,每个 prompt 都需要以 <think>\n 结尾,如果是数学推理内容,建议包含(中英文皆可):Please reason step by step, and put your final answer within \boxed{}.
# messages = [
# {"role": "user", "content": prompt+"<think>\n"}
# ]
# 作为聊天模板的消息,不是必要的。
# text = tokenizer.apply_chat_template(
# messages,
# tokenize=False,
# add_generation_prompt=True
# )
outputs = get_completion(text, model, tokenizer=tokenizer, max_tokens=8192, temperature=0.6, top_p=0.95, max_model_len=2048) # 思考需要输出更多的 Token 数,max_tokens 设为 8K,根据 DeepSeek 官方的建议,temperature应在 0.5-0.7,推荐 0.6
# 输出是一个包含 prompt、生成文本和其他信息的 RequestOutput 对象列表。
# 打印输出。
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
if r"</think>" in generated_text:
think_content, answer_content = generated_text.split(r"</think>")
else:
think_content = ""
answer_content = generated_text
print(f"Prompt: {prompt!r}, Think: {think_content!r}, Answer: {answer_content!r}")
这个推理脚本来自于歌刎大佬,里面还有一些参数我还没有修改尝试,欢迎各位评论区讨论。
2025 最新 DeepSeek-R1-Distill-Qwen-7B vLLM 部署全攻略:从环境搭建到性能测试(V100-32GB)-CSDN博客