【杂记】vLLM多卡推理踩坑记录
目录
- 写在前面
- 配置
- 问题一
- 解决办法
- 问题二
- 解决办法
- 问题三
- 解决办法
- 问题四
- 解决办法
写在前面
仅作个人学习与记录用。主要记录vLLM在多卡推理时遇到的问题。
配置
vllm version: 0.5.1
GPU 0: Tesla V100-PCIE-16GB
GPU 1: Tesla V100-PCIE-16GB
问题一
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-PCIE-16GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.
解决办法
查看 官方文档
dtype – The data type for the model weights and activations. Currently, we support float32, float16, and bfloat16. If auto, we use the torch_dtype attribute specified in the model config file. However, if the torch_dtype in the config is float32, we will use float16 instead.
因此在代码中添加:
llm = LLM(
...
dtype='float16',
...
)
问题二
在解决问题一后,当把vLLM.LLM
中的参数tensor_parallel_size
设置为大于1的值时(即设置多卡),程序会卡住并引发RuntimeError:
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method.
解决办法
按照 https://github.com/vllm-project/vllm/issues/6152中的回复,设置:
export VLLM_WORKER_MULTIPROC_METHOD=spawn
或在代码中添加:
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
问题三
解决问题二后,程序有可能会卡住并引发与新进程的引导阶段相关的RuntimeError:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
ERROR 11-28 20:12:42 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 3449547 died, exit code: 1
INFO 11-28 20:12:42 multiproc_worker_utils.py:123] Killing local vLLM worker processes
解决办法
按照 https://github.com/vllm-project/vllm/issues/5637 的解决办法,将spawn
改成fork
可能不会奏效,这是因为:
It seems some tests will initialize cuda before launching vllm worker, which makes fork not possible.
此时解决方法可以尝试在任何可能对CUDA动手动脚的命令之前首先添加:
VLLM_WORKER_MULTIPROC_METHOD=spawn
或在代码最开头添加:
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
问题四
解决问题三后,重新运行,日志信息卡在:
(VllmWorkerProcess pid=166876)[0;0m INFO 11-25 20:57:27 pynccl.py:63] vLLM is using nccl==2.20.5
解决办法
按照 https://docs.vllm.ai/en/stable/getting_started/debugging.html 打开vllm日志:
export VLLM_LOGGING_LEVEL=DEBUG to turn on more logging.
export CUDA_LAUNCH_BLOCKING=1 to identify which CUDA kernel is causing the problem.
export NCCL_DEBUG=TRACE to turn on more logging for NCCL.
export VLLM_TRACE_FUNCTION=1 to record all function calls for inspection in the log files to tell which function crashes or hangs.
也可以在代码中添加:
os.environ["VLLM_LOGGING_LEVEL"]="DEBUG"
os.environ["NCCL_DEBUG"]="TRACE"
os.environ["VLLM_TRACE_FUNCTION"]="1"
获取详细报错信息。
如果没有更多报错信息,且你所遇到的问题与上述描述完全一致,可能是由于GPU间P2P通信不能正常工作。
按照 https://github.com/NVIDIA/nccl/issues/631 中的描述,可以尝试添加:
export NCCL_P2P_DISABLE=1
或在代码最开头添加:
os.environ["NCCL_P2P_DISABLE"]="1"