当前位置：首页 > article >正文

深度解析：通过 AIBrix 多节点部署 DeepSeek-R1 671B 模型

article 2025/3/19 12:00:39

原文链接：https://aibrix.github.io/posts/2025-03-10-deepseek-r1/

本文详细介绍了如何通过 AIBrix 分布式推理平台实现 DeepSeek-R1 671B 的多节点部署。DeepSeek-R1 通过渐进式训练框架展现出优秀的逻辑推理能力 —— 在 6710 亿总参数量中，其动态激活的 370 亿参数与 128k 上下文窗口，使其在复杂任务处理中表现卓越。然而，如此庞大的模型规模对部署环境提出了严苛挑战，尤其是分布式推理的资源调度与性能优化。AIBrix 通过自主研发的容器化编排技术，实现了：

多节点 GPU 资源的智能分配
分布式推理服务的无缝管理
基于 RDMA 的高性能网络通信
自动化弹性伸缩策略

通过与 AIBrix 的结合，使得原本需要数周的超大规模模型部署周期缩短至小时级。接下来，我们将从集群配置、镜像优化、存储方案等关键环节展开，为您呈现完整的技术实现路径。
在这里插入图片描述

Reference: https://huggingface.co/deepseek-ai/DeepSeek-R1/resolve/main/figures/benchmark.jpg

前置准备

在 AIBrix 部署 DeepSeek-R1 前，需完成以下准备工作：将模型权重下载至对象存储或共享文件系统，并配置定制化容器镜像。本文将聚焦关键步骤，更多细节请参考我们的代码示例与教程： https://github.com/vllm-project/aibrix/tree/main/samples/deepseek-r1。

集群配置

DeepSeek-R1 671B 需要 16 块显存 80GB GPU。测试使用的实例规格如下，您可根据实际环境选择类似配置

云平台：火山引擎
实例：ecs.ebmhpcpni3l.48xlarge * 2
CPU: 192 vCPU
内存：2048GiB DRAM
GPU: NVIDIA H20-SXM5-96GB*8
网络：400 Gbps * 8 RDMA + 96 Gbps
磁盘：本地盘 NVME 3576GiB * 4

vLLM 镜像

本次部署使用的镜像是 aibrix/vllm-openai:v0.7.3.self.post1，这是由 AIBrix 定制构建的镜像。选择该镜像主要有两个原因：

在 v0.7.3 的上游版本中，存在一个与旧版 NCCL 版本相关的问题： https://aibrix.github.io/posts/2025-03-10-deepseek-r1/，该问题会导致系统偶尔挂起。我们通过将 nvidia-nccl-cu12==2.25.1 进行升级，以增强通信的稳定性。
在 v0.7.3 版本中，出现了一个与 Ray 相关的回归问题，我们之前在 vLLM 中的修改被覆盖了。为了缓解这个问题，我们重新引入了 ray[default,adag]，以便为高可用性和故障检测提供更好的探测支持。

如果你想自己构建镜像，可以使用以下 Dockerfile。

FROM vllm/vllm-openai:v0.7.3
RUN pip3 install -U ray[default,adag]==2.40.0
RUN pip3 install -U nvidia-nccl-cu12
ENTRYPOINT [""]

注意：对于中国的用户，从我们的镜像仓库拉取镜像时，可能需要在镜像名称前加上 aibrix-container-registry-cn-beijing.cr.volces.com/。例如，不要只使用 aibrix/vllm-openai:v0.7.3.self.post1，而应该使用 aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.7.3.self.post1。同样的规则也适用于 aibrix/runtime:v0.2.1。

模型权重

用户可以根据自己的云服务提供商，为模型权重： https://huggingface.co/deepseek-ai/DeepSeek-R1选择不同的存储选项。这里，我们将讨论四种常见的场景：

HuggingFace：一个 Pod 可以直接从 HuggingFace 检索模型权重。然而，需要注意的是，不建议对 DeepSeek R1 从 HuggingFace 获取权重。这是因为张量大小各不相同，会导致大量的随机读取，从而显著降低网络和 I/O 效率。
持久卷（Persistent Volume）：像 AWS（使用 Lustre）或谷歌云这样的云服务提供商，通过其容器存储接口（CSI）： https://kubernetes-csi.github.io/docs/提供持久磁盘。用户可以轻松地将持久卷声明（PVC）： https://kubernetes.io/docs/concepts/storage/persistent-volumes/挂载到 Pod 上，从而能够无缝访问存储在这些持久磁盘上的模型权重。
结合 AI 运行时的 对象存储：用户可以选择将模型权重存储在对象存储服务中，如亚马逊 S3 或谷歌云存储（GCS）。在这种情况下，AIBrix AI 运行时将自动将模型下载到主机卷上。这种方法利用了对象存储在存储大量数据方面的优势，提供了灵活性和可扩展性。
本地磁盘：对于本地磁盘存储，需要额外的过程将模型权重下载到本地磁盘上。我们假设存在一个可用的本地卷，并且可以成功地挂载到 Pod 上。这种选择可能适用于本地存储能带来性能优势的环境，或者当存在特定的安全或延迟要求时。

高性能网络

为了利用 RDMA 在网络通信中实现最佳性能，我们需要配置 Pod 配置以使用 RDMA。在注解中配置 k8s.volcengine.com/pod-networks，如下所示，并且在 Pod 资源级别需要 vke.volcengine.com/rdma: "8"。这只是火山引擎云上的一个示例，你需要根据自己的云环境进行相应的更改。

  k8s.volcengine.com/pod-networks: |
    [
      {
        "cniConf":{
            "name":"rdma"
        }
      },
      ....
      {
        "cniConf":{
            "name":"rdma"
        }
      }
    ]

此外，我们还需要 IPC_LOCK 和共享内存支持。

  securityContext:
    capabilities:
      add:
      - IPC_LOCK

组件安装

多节点部署需要 AIBrix v0.2.1 ：https://github.com/vllm-project/aibrix/releases/tag/v0.2.1版本。在部署 AIBrix 时，需要注意的是，AIBrix 镜像主要托管在 Dockerhub 上，在对 Dockerhub 访问有限制的环境中进行部署可能具有挑战性。为了克服这个问题，请查看我们的教程，使用你自己的镜像仓库覆盖控制平面镜像，从而实现定制化 AIBrix 的顺利部署。

需要强调的是，可能存在一些与云环境相关的方面，例如 ReadWriteMany PersistentVolume、 本地磁盘 等。我们这里以火山引擎：https://www.volcengine.com/为例，但建议用户检查自己的云基础设施。虽然我们可以提供一些一般性的建议，但由于资源有限，我们无法对所有云平台进行全面测试。我们鼓励社区通过提交拉取请求（PR）来提供帮助，以改进我们对不同云平台的支持。

kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.2.1/aibrix-dependency-v0.2.1.yaml
kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.2.1/aibrix-core-v0.2.1.yaml

AIBrix 如何支持 DeepSeek-R1

AIBrix 在支持 Deepseek-r1 671B 模型的部署中起着至关重要的作用。它提供了一个全面的平台，支持分布式编排、高效的流量路由以及智能的缩放功能。这些特性对于处理 Deepseek-r1 671B 模型的大规模和资源密集型特性至关重要。

在开始部署之前，我们将简要介绍与本案例相关的 RayClusterFleet、 Gateway-Plugin 和 Autoscaler 的功能。

在这里插入图片描述

RayClusterFleet 在管理分布式推理编排方面发挥着关键作用。它调配 Pod 并构建一个 Ray 集群，在该集群中启动 vLLM 服务器。因此，每个小型 Ray 集群都构成一个推理副本。

在多节点环境中，vLLM HTTP 服务器在头节点上启动。其余的 GPU 节点充当工作节点，这些节点上不运行 HTTP 服务。相应地，AIBrix 路由器仅将请求路由到头节点。同样，自动缩放器仅从服务 Pod 中获取指标。这种分布式配置确保了编排、路由和自动缩放机制能够有效地运行。通过以类似于单节点操作的方式管理多节点设置，它简化了像 Deepseek-r1 这样的超大型模型的整体部署过程。

模型部署

首先，确保你更改了网络和对象存储配置，例如使用 s3：https://aibrix.readthedocs.io/latest/features/runtime.html#download-from-s3。对于 Deepseek-r1， DOWNLOADER_ALLOW_FILE_SUFFIX 必须为 json、 safetensors、 py。

然后运行以下命令来部署模型以及相关的基于 kv 缓存的自动缩放策略。请注意，这确实取决于计算节点与对象存储之间的网络速度，部署可能需要长达 20 分钟。

kubectl apply -f deepseek-r1-ai-runtime.yaml
kubectl apply -f deepseek-r1-autoscaling.yaml

过一会儿，你应该会看到类似于以下的正在运行的 Pod。

kubectl get pods
NAME                                                          READY   STATUS              RESTARTS          AGE
deepseek-r1-671b-7ffb754f75-ggnzf-head-7xr6q                  1/1     Running             0                 25m
deepseek-r1-671b-7ffb754f75-ggnzf-worker-group-worker-gj456   1/1     Running             0                 25m

发送请求

通过以下命令暴露Endpoint济宁调用。

# Option 1: Kubernetes cluster with LoadBalancer support
LB_IP=$(kubectl get svc/envoy-aibrix-system-aibrix-eg-903790dc -n envoy-gateway-system -o=jsonpath='{.status.loadBalancer.ingress[0].ip}')
ENDPOINT="${LB_IP}:80"

# Option 2: Dev environment without LoadBalancer support. Use port forwarding way instead
kubectl -n envoy-gateway-system port-forward service/envoy-aibrix-system-aibrix-eg-903790dc 8888:80 &
ENDPOINT="localhost:8888"

curl http://${ENDPOINT}/v1/chat/completions \
    -H "Content-Type: application/json" -H "routing-strategy: least-request" \
    -d '{
        "model": "deepseek-r1-671b",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

注意：如果你想使用默认的 Kubernetes 路由策略，可以删除 -H "routing-strategy: least-request" 标头。

你应该会看到类似以下的响应：

{"id":"chatcmpl-d26583d2-96e5-42c4-a322-133c7d0e505d","object":"chat.completion","created":1740967604,"model":"deepseek-r1-671b","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"<think>\nOkay, the user is asking which team won the World Series in 2020. Let me recall, the World Series is the championship series of Major League Baseball (MLB) in the United States. I remember that 2020 was a unique year because of the COVID-19 pandemic, which affected the schedule and format of the season. The season was shortened, and there were some changes to the playoff structure.\n\nI think the Los Angeles Dodgers won the World Series around that time. Let me verify. The 2020 World Series was held at a neutral site, which was Globe Life Field in Arlington, Texas, to minimize travel and reduce the risk of COVID-19 spread. The Dodgers faced the Tampa Bay Rays. The Dodgers were led by players like Mookie Betts, Corey Seager, and Clayton Kershaw. They won the series in six games. The clinching game was Game 6, where the Dodgers beat the Rays 3-1. That victory gave the Dodgers their first title since 1988, ending a long drought.\n\nWait, let me make sure I got the opponent right. Was it the Rays or another team? Yes, I'm pretty confident it was the Rays because earlier in the playoffs, teams like the Braves and Dodgers were in the National League, while the Rays were the American League champions. The Rays had a strong team with players like Randy Arozarena, who had a standout postseason. But the Dodgers ultimately triumphed. So the answer should be the Los Angeles Dodgers. Let me double-check a reliable source if I'm unsure. Confirming now... yes, the Dodgers won the 2020 World Series against the Tampa Bay Rays in six games. So the user needs to know both the winner and maybe a bit of context, like it being in a neutral location. Okay, ready to provide a concise answer with those details.\n</think>\n\nThe Los Angeles Dodgers won the 2020 World Series, defeating the Tampa Bay Rays in six games. This championship marked the Dodgers' first title since 1988. Notably, the 2020 series was held at Globe Life Field in Arlington, Texas—a neutral site—due to COVID-19 health and safety protocols.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":19,"total_tokens":472,"completion_tokens":453,"prompt_tokens_details":null},"prompt_logprobs":null}%

监控

我们假设你已经在集群中设置了 Prometheus：https://prometheus.io/，然后你可以部署 ServiceMonitor，以便它能够从 Deepseek 部署中获取指标。

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: deepseek-r1-svc-discover
  namespace: default
  labels:
    volcengine.vmp: "true"
spec:
  endpoints:
  - port: service
  namespaceSelector:
    matchNames:
    - default
  selector:
    matchLabels:
      ray.io/node-type: head

你可以使用我们自己构建的仪表盘：https://github.com/vllm-project/aibrix/blob/main/samples/deepseek-r1/static/AIBrix%20Engine%20Dashboard%20(vLLM)-1741078999667.json来查看模型的性能。