当前位置：首页 > article >正文

用 Ray 扩展 AI 应用

article 2025/1/25 12:12:25

Scale AI applications with Ray — ROCm Blogs (amd.com)

2024 年 4 月 1 日 by Vicky Tsang、Logan Grado、Eliot Li 编写

当今大多数机器学习（ML）工作负载需要多个 GPU 或节点才能达到应用程序所需的性能或规模。然而，将工作负载扩展到单节点/单 GPU 之外是困难的，并且需要一些分布式处理方面的专业知识。

Ray 开发了一个平台，可以使 AI 实践者只需几行代码就能将其代码从笔记本电脑扩展到节点集群。该平台旨在支持各种常见的机器学习用例（将在下一节中描述）。Ray 是一个在 Apache 2.0 许可证下的开源项目。

AMD 一直与 Ray 合作，以在 ROCm 平台上提供支持。在这篇博客中，我们将描述如何使用 Ray 轻松地将您的 AI 应用程序从笔记本电脑扩展到多个 AMD GPU。您可以在GitHub folder中找到与此博客相关的文件。

用例

Ray 可以应用于许多扩展机器学习应用的用例，例如：
- 大语言模型 (LLMs) 和生成 AI
- 批量推理
- 模型服务
- 超参数调优
- 分布式训练
- 强化学习
- 机器学习平台
- 端到端机器学习工作流
- 大规模工作负载编排请参阅 Ray 文档获取这些用例的详细教程。我们将使用 AMD GPU 和 ROCm 探索一些常见的用例。

安装带有 ROCm 支持的 Ray

以下是如何在单节点上安装具有ROCm支持的Ray的简要说明。

先决条件

您需要一个支持ROCm的AMD GPU节点，参考ROCm支持的系统要求。
一个支持的Linux发行版，参考支持的操作系统。
ROCm - 请参阅安装说明。

为了运行本博客中的示例所需的设置

为了便于运行本博客中的示例，我们还提供了一个dockerfile，它将安装运行这些示例所需的预备条件。

• 从ROCm基本镜像开始
• 创建一个Python虚拟环境
• 安装具有ROCm支持的Ray
• 安装其他Python依赖项（如PyTorch等）

dockerfile和`docker-compose.yaml`文件可以在这个GitHub文件夹中找到。为方便您的使用，我们也在附录中包含了这些文件。只需创建一个`docker`文件夹并添加这两个文件。然后可以使用以下命令构建并启动Docker容器，同时将本博客的目录挂载到`/root`路径下：

cd docker
docker compose run ray-blog

提示

由于博客目录已挂载到容器中，您在博客目录中创建或编辑的任何文件将立即在容器内可用。

另外，您也可以构建自己的Python环境（请参阅附录中的在主机上运行）。

仅安装Ray

如果您仅想安装带有AMD支持的`ray`并用自己的代码进行测试，可以使用`pip install`来安装此适当的预构建包。这些预构建包已增加了ROCm支持。

pip install "ray[data,train,tune,serve] @ https://github.com/ROCm/ray/releases/download/v3.0.0-dev0%2Brocm/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl"

提示

一旦Ray发布了具有官方ROCm支持的版本，您就可以使用Ray安装说明。

示例

Now that we have a node ready to use Ray to scale our applications, we can illustrate how it works for these use cases:

Use Ray Train to fine-tune a transformer model
Convert an LLM model into a Ray Serve application
Use Ray Serve to serve a Stable Diffusion model
Use Ray Tune to tune an XGBoost classifier

使用 Ray Train 微调一个 Transformer 模型

下载脚本 transformers_torch_trainer_basic.py，该脚本使用 Ray Train 库来扩展使用 Hugging Face 上 BERT base model 和 Yelp review dataset 的微调。

你可以使用 `curl` 快速下载该脚本：

curl https://raw.githubusercontent.com/ROCm/ray/master/python/ray/train/examples/transformers/transformers_torch_trainer_basic.py > transformers_torch_trainer_basic.py

让我们通过在 transformers_torch_trainer_basic.py 脚本的最后部分设置 num_workers=2 来使用两块 GPU 调整模型：

# [4] 构建一个 Ray TorchTrainer 来在所有工作节点上启动 `train_func`
# ==================================================================
trainer = TorchTrainer(
    train_func, scaling_config=ScalingConfig(num_workers=2, use_gpu=True)
)

trainer.fit()

我们现在准备运行该脚本：

python transformers_torch_trainer_basic.py

输出应如下所示：

Usage stats collection is enabled by default for nightly wheels. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2024-03-06 23:34:17,106 INFO worker.py:1754 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
2024-03-06 23:34:18,177 INFO tune.py:220 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Trainer(...)`.
2024-03-06 23:34:18,178 INFO tune.py:592 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949

View detailed results here: /root/ray_results/TorchTrainer_2024-03-06_23-34-14
To visualize your results with TensorBoard, run: `tensorboard --logdir /root/ray_results/TorchTrainer_2024-03-06_23-34-14`

Training started without custom configuration.
(RayTrainWorker pid=75298) Setting up process group for: env:// [rank=0, world_size=2]
(TorchTrainer pid=71127) Started distributed worker processes:
(TorchTrainer pid=71127) - (ip=10.216.70.82, pid=75298) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=71127) - (ip=10.216.70.82, pid=75299) world_rank=1, local_rank=1, node_rank=0
Downloading readme: 100%|██████████| 6.72k/6.72k [00:00<00:00, 39.9MB/s]
Downloading data:   0%|          | 0.00/299M [00:00<?, ?B/s]
Downloading data:   1%|▏         | 4.19M/299M [00:00<00:39, 7.45MB/s]
Downloading data:   4%|▍         | 12.6M/299M [00:00<00:20, 14.0MB/s]
Downloading data:   7%|▋         | 21.0M/299M [00:01<00:14, 19.3MB/s]

...
...
...

(RayTrainWorker pid=75298) {'eval_loss': 0.9538511633872986, 'eval_accuracy': 0.589, 'eval_runtime': 3.1428, 'eval_samples_per_second': 318.185, 'eval_steps_per_second': 20.046, 'epoch': 3.0}
(RayTrainWorker pid=75298) {'train_runtime': 39.3335, 'train_samples_per_second': 76.271, 'train_steps_per_second': 4.805, 'train_loss': 1.1883205837673612, 'epoch': 3.0}
100%|██████████| 189/189 [00:39<00:00,  4.81it/s]

Training completed after 0 iterations at 2024-03-06 23:38:20. Total running time: 4min 2s

使用两块 GPU，大约需要 4 分钟完成微调。让我们尝试用四块 GPU 运行相同的任务（假设你的系统中至少有四块 GPU）。通过将 num_workers 从 2 更改为 4 进行此操作：

# [4] 构建一个 Ray TorchTrainer 来在所有工作节点上启动 `train_func`
# ==================================================================
trainer = TorchTrainer(
    train_func, scaling_config=ScalingConfig(num_workers=4, use_gpu=True)
)

trainer.fit()

再次运行该脚本应产生以下输出：

Usage stats collection is enabled by default for nightly wheels. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2024-03-06 23:49:10,338 INFO worker.py:1754 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
2024-03-06 23:49:11,468 INFO tune.py:220 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Trainer(...)`.
2024-03-06 23:49:11,469 INFO tune.py:592 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949

View detailed results here: /root/ray_results/TorchTrainer_2024-03-06_23-49-08
To visualize your results with TensorBoard, run: `tensorboard --logdir /root/ray_results/TorchTrainer_2024-03-06_23-49-08`

Training started without custom configuration.
(RayTrainWorker pid=83857) Setting up process group for: env:// [rank=0, world_size=4]
(TorchTrainer pid=83721) Started distributed worker processes:
(TorchTrainer pid=83721) - (ip=10.216.70.82, pid=83857) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=83721) - (ip=10.216.70.82, pid=83858) world_rank=1, local_rank=1, node_rank=0
(TorchTrainer pid=83721) - (ip=10.216.70.82, pid=83859) world_rank=2, local_rank=2, node_rank=0
(TorchTrainer pid=83721) - (ip=10.216.70.82, pid=83860) world_rank=3, local_rank=3, node_rank=0
Map:   0%|          | 0/50000 [00:00<?, ? examples/s]
Map:   2%|▏         | 1000/50000 [00:00<00:13, 3510.56 examples/s]
Map:   0%|          | 0/50000 [00:00<?, ? examples/s] [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
Map:  40%|████      | 20000/50000 [00:05<00:07, 3945.42 examples/s] [repeated 75x across cluster]
Map:  80%|████████  | 40000/50000 [00:10<00:02, 4170.23 examples/s] [repeated 82x across cluster]
Map:  92%|█████████▏| 46000/50000 [00:11<00:00, 4377.90 examples/s]
Map: 100%|██████████| 50000/50000 [00:12<00:00, 3899.61 examples/s]

...
...
...

(RayTrainWorker pid=83857) 01<00:00, 20.58it/s]
100%|██████████| 96/96 [00:20<00:00,  4.59it/s]
(RayTrainWorker pid=83857) {'eval_loss': 0.9640204906463623, 'eval_accuracy': 0.578, 'eval_runtime': 1.7283, 'eval_samples_per_second': 578.6, 'eval_steps_per_second': 18.515, 'epoch': 3.0}
(RayTrainWorker pid=83857) {'train_runtime': 20.9123, 'train_samples_per_second': 143.456, 'train_steps_per_second': 4.591, 'train_loss': 1.1330984433492024, 'epoch': 3.0}

Training completed after 0 iterations at 2024-03-06 23:49:59. Total running time: 47s

使用这两块额外的 GPU，相同任务花费了 47 秒。借助 Ray，通过更改代码中的一个参数，即可轻松扩展任务所需的资源。

将 LLM 模型转换为 Ray Serve 应用程序

我们可以在本地开发一个 Ray Serve 应用程序，并在生产集群（使用 AMD GPU）上部署。只需几行代码即可实现。详细说明请参考官方的Ray 文档页面.

我们以英语到法语翻译为例部署一个 ML 应用程序。首先，我们基于Ray 文档页面的 model.py 脚本创建 Python 脚本 RayServe_En2Fr_translation_local.py，用于将英语文本翻译成法语。

# 文件名: RayServe_En2Fr_translation_local.py
from transformers import pipeline


class Translator:
    def __init__(self):
        # 加载模型
        self.model = pipeline("translation_en_to_fr", model="t5-small")

    def translate(self, text: str) -> str:
        # 运行推断
        model_output = self.model(text)

        # 后处理输出，只返回翻译文本
        translation = model_output[0]["translation_text"]

        return translation


translator = Translator()

translation = translator.translate("Hello world!")
print(translation)

我们可以通过运行以下命令在本地测试脚本：

python RayServe_En2Fr_translation_local.py

config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.21k/1.21k [00:00<00:00, 612kB/s]
model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 242M/242M [00:02<00:00, 115MB/s]
generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 147/147 [00:00<00:00, 93.6kB/s]
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.32k/2.32k [00:00<00:00, 2.76MB/s]
spiece.model: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 792k/792k [00:00<00:00, 11.7MB/s]
tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.39M/1.39M [00:00<00:00, 5.76MB/s]
Bonjour monde!

接下来，我们将这个脚本转换为 RayServe_En2Fr_translation.py，它支持使用 FastAPI 的 Ray Serve 应用程序，基于 Ray 文档页面上的说明。

# 文件名: RayServe_En2Fr_translation.py
import ray
from ray import serve
from fastapi import FastAPI

from transformers import pipeline

app = FastAPI()


@serve.deployment(num_replicas=2, ray_actor_options={"num_cpus": 0.2, "num_gpus": 0})
@serve.ingress(app)
class Translator:
    def __init__(self):
        # 加载模型
        self.model = pipeline("translation_en_to_fr", model="t5-small")

    @app.post("/")
    def translate(self, text: str) -> str:
        # 运行推断
        model_output = self.model(text)

        # 后处理输出，只返回翻译文本
        translation = model_output[0]["translation_text"]

        return translation


translator_app = Translator.bind()

我们在后台启动 translator_app 应用，以为将英语翻译成法语的 LLM 模型提供服务。我们使用 serve run CLI 命令运行脚本，该命令接受格式为 <module>:<application> 的导入路径。

在包含 RayServe_En2Fr_translation.py 脚本的目录中运行命令，以便它可以导入应用程序：

serve run RayServe_En2Fr_translation:translator_app &

预计输出如下：

2024-02-01 16:30:33,699        INFO scripts.py:413 -- Running import path: 'RayServe_En2Fr_translation:translator_app'.
Usage stats collection is enabled by default for nightly wheels. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2024-02-01 16:30:37,719 INFO worker.py:1753 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8266
(ProxyActor pid=333236) INFO 2024-02-01 16:30:41,275 proxy 10.216.70.84 proxy.py:1145 - Proxy actor d37bfb4a9e388b83347f768101000000 starting on node a5c7469f3d4184bfb971897878429cc42cfd73d4fb484fab478f6215.
(ProxyActor pid=333236) INFO 2024-02-01 16:30:41,278 proxy 10.216.70.84 proxy.py:1357 - Starting HTTP server on node: a5c7469f3d4184bfb971897878429cc42cfd73d4fb484fab478f6215 listening on port 8000
(ProxyActor pid=333236) INFO:     Started server process [333236]
(ServeController pid=333144) INFO 2024-02-01 16:30:41,379 controller 333144 deployment_state.py:1580 - Deploying new version of deployment Translator in application 'default'. Setting initial target number of replicas to 2.
(ServeController pid=333144) INFO 2024-02-01 16:30:41,481 controller 333144 deployment_state.py:1865 - Adding 2 replicas to deployment Translator in application 'default'.
2024-02-01 16:30:45,323 SUCC scripts.py:457 -- Deployed app successfully.

服务器在我们的集群中设置好之后，我们可以使用 RayServe_En2Fr_translation_client.py 脚本在本地测试应用。这是Ray 文档页面中的 model_client.py 脚本，我们重命名为 RayServe_En2Fr_translation_client.py。它发送包含英文文本的 POST 请求（以 JSON 格式）。

# 文件名: RayServe_En2Fr_tranlation_client.py
import requests

response = requests.post("http://127.0.0.1:8000/", params={"text": "Hello world!"})
french_text = response.json()

print(french_text)

此客户端脚本请求翻译短语“Hello world!”：

python RayServe_En2Fr_tranlation_client.py

预计输出如下：

Bonjour monde!
(ServeReplica:default:Translator pid=333328) INFO 2024-02-01 16:38:02,251 default_Translator hxagjhct 6625dbbe-cbba-40cd-ba86-5cbff9cb6aa2 / replica.py:380 - __CALL__ OK 192.1ms

利用 Ray Serve 来提供 Stable Diffusion 模型的服务

稳定扩散（Stable Diffusion）是最受欢迎的图像生成模型之一。它接收一个文本提示，并根据提示的意思生成图像。

在这个示例中，我们使用 Ray 来启动一个stabilityai/stable-diffusion-2-1-base模型的服务器，该服务器的 API 由 FastAPI提供支持。

要运行此示例，请安装以下内容：

pip install requests diffusers==0.12.1 transformers

创建一个带有 Serve 代码的 Python 脚本，按照 Ray 文档保存为 RayServe_StableDiffusion.py。

# 文件名: RayServe_StableDiffusion.py
from io import BytesIO
from fastapi import FastAPI
from fastapi.responses import Response
import torch

from ray import serve
from ray.serve.handle import DeploymentHandle

app = FastAPI()

@serve.deployment(num_replicas=1)
@serve.ingress(app)
class APIIngress:
    def __init__(self, diffusion_model_handle: DeploymentHandle) -> None:
        self.handle = diffusion_model_handle

    @app.get(
        "/imagine",
        responses={200: {"content": {"image/png": {}}}},
        response_class=Response,
    )
    async def generate(self, prompt: str, img_size: int = 512):
        assert len(prompt), "prompt parameter cannot be empty"

        image = await self.handle.generate.remote(prompt, img_size=img_size)
        file_stream = BytesIO()
        image.save(file_stream, "PNG")
        return Response(content=file_stream.getvalue(), media_type="image/png")

@serve.deployment(
    ray_actor_options={"num_gpus": 1},
    autoscaling_config={"min_replicas": 0, "max_replicas": 2},
)
class StableDiffusionV2:
    def __init__(self):
        from diffusers import EulerDiscreteScheduler, StableDiffusionPipeline

        model_id = "stabilityai/stable-diffusion-2"

        scheduler = EulerDiscreteScheduler.from_pretrained(
            model_id, subfolder="scheduler"
        )
        self.pipe = StableDiffusionPipeline.from_pretrained(
            model_id, scheduler=scheduler, revision="fp16", torch_dtype=torch.float16
        )
        self.pipe = self.pipe.to("cuda")

    def generate(self, prompt: str, img_size: int = 512):
        assert len(prompt), "prompt parameter cannot be empty"

        with torch.autocast("cuda"):
            image = self.pipe(prompt, height=img_size, width=img_size).images[0]
            return image

entrypoint = APIIngress.bind(StableDiffusionV2.bind())

使用以下命令启动 Serve 应用程序：

serve run RayServe_StableDiffusion:entrypoint &

预期输出如下：

2024-03-12 23:00:21,290     INFO scripts.py:502 -- Running import path: 'RayServe_StableDiffusion:entrypoint'.
2024-03-12 23:00:22,443 WARNING api.py:424 -- The default value for `max_ongoing_requests` is currently 100, but will change to 5 in the next upcoming release.
2024-03-12 23:00:22,453 WARNING api.py:364 -- The default value for `target_ongoing_requests` is currently 1.0, but will change to 2.0 in an upcoming release.
2024-03-12 23:00:22,453 WARNING api.py:424 -- The default value for `max_ongoing_requests` is currently 100, but will change to 5 in the next upcoming release.
Usage stats collection is enabled by default for nightly wheels. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2024-03-12 23:00:24,606 INFO worker.py:1743 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
(ProxyActor pid=121115) INFO 2024-03-12 23:00:28,216 proxy 10.216.70.82 proxy.py:1160 - Proxy starting on node 724ae23b5a7bfda10da26d36de8efc5af99cd7d7b1ddbb4379810ff5 (HTTP port: 8000).
(ServeController pid=121023) INFO 2024-03-12 23:00:28,316 controller 121023 deployment_state.py:1581 - Deploying new version of Deployment(name='StableDiffusionV2', app='default') (initial target replicas: 0).
(ServeController pid=121023) INFO 2024-03-12 23:00:28,317 controller 121023 deployment_state.py:1581 - Deploying new version of Deployment(name='APIIngress', app='default') (initial target replicas: 1).
(ServeController pid=121023) INFO 2024-03-12 23:00:28,419 controller 121023 deployment_state.py:1883 - Adding 1 replica to Deployment(name='APIIngress', app='default').
2024-03-12 23:00:30,262 INFO api.py:601 -- Deployed app 'default' successfully.

现在我们可以通过 API 向服务器发送请求。创建脚本 RayServe_StableDiffusion_client.py ，使用 Ray 文档中的客户端代码。

# 文件名: RayServe_StableDiffusion_client.py
import requests

prompt = "a cute cat is dancing on the grass."
input = "%20".join(prompt.split(" "))
resp = requests.get(f"http://127.0.0.1:8000/imagine?prompt={input}")
with open("output.png", 'wb') as f:
    f.write(resp.content)

运行 RayServe_StableDiffusion_client.py 脚本，向这个应用程序发送提示为“a cute cat is dancing on the grass.”的请求。

python RayServe_StableDiffusion_client.py

生成的图像被保存为 output.png。

预期输出如下：

(ServeController pid=108630) INFO 2024-03-12 22:58:24,938 controller 108630 deployment_state.py:1648 - Autoscaling Deployment(name='StableDiffusionV2', app='default') to 1 replicas. Current num requests: 1, current num running replicas: 0.
(ServeController pid=108630) INFO 2024-03-12 22:58:24,939 controller 108630 deployment_state.py:1883 - Adding 1 replica to Deployment(name='StableDiffusionV2', app='default').
Fetching 12 files: 100%|██████████| 12/12 [00:00<00:00, 158774.91it/s]
(ServeReplica:default:StableDiffusionV2 pid=113302) Cannot initialize model with low cpu memory usage because `accelerate` was not found in the environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install `accelerate` for faster and less memory-intense model loading. You can do so with:
(ServeReplica:default:StableDiffusionV2 pid=113302) ```
(ServeReplica:default:StableDiffusionV2 pid=113302) pip install accelerate
(ServeReplica:default:StableDiffusionV2 pid=113302) ```
(ServeReplica:default:StableDiffusionV2 pid=113302) .
(ServeReplica:default:StableDiffusionV2 pid=113302) /opt/conda/envs/ray_py3.8/lib/python3.8/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
(ServeReplica:default:StableDiffusionV2 pid=113302)   return self.fget.__get__(instance, owner)()
  0%|          | 0/50 [00:00<?, ?it/s]2 pid=113302)
  2%|▏         | 1/50 [00:00<00:13,  3.54it/s]3302)
  6%|▌         | 3/50 [00:01<00:30,  1.56it/s]3302)
 10%|█         | 5/50 [00:01<00:15,  2.89it/s]3302)
 14%|█▍        | 7/50 [00:02<00:09,  4.39it/s]3302)
 18%|█▊        | 9/50 [00:02<00:06,  5.97it/s]3302)
 22%|██▏       | 11/50 [00:02<00:05,  7.55it/s]302)
 26%|██▌       | 13/50 [00:02<00:04,  9.02it/s]302)
 30%|███       | 15/50 [00:02<00:03, 10.32it/s]302)
 34%|███▍      | 17/50 [00:02<00:02, 11.42it/s]302)
 38%|███▊      | 19/50 [00:02<00:02, 12.30it/s]302)
 42%|████▏     | 21/50 [00:03<00:02, 12.98it/s]302)
 46%|████▌     | 23/50 [00:03<00:02, 13.49it/s]302)
 50%|█████     | 25/50 [00:03<00:01, 13.87it/s]302)
 54%|█████▍    | 27/50 [00:03<00:01, 14.14it/s]302)
 58%|█████▊    | 29/50 [00:03<00:01, 14.33it/s]302)
 62%|██████▏   | 31/50 [00:03<00:01, 14.47it/s]302)
 66%|██████▌   | 33/50 [00:03<00:01, 14.57it/s]302)
 70%|███████   | 35/50 [00:03<00:01, 14.65it/s]302)
 74%|███████▍  | 37/50 [00:04<00:00, 14.70it/s]302)
 78%|███████▊  | 39/50 [00:04<00:00, 14.73it/s]302)
 82%|████████▏ | 41/50 [00:04<00:00, 14.76it/s]302)
 86%|████████▌ | 43/50 [00:04<00:00, 14.78it/s]302)
 90%|█████████ | 45/50 [00:04<00:00, 14.79it/s]302)
 94%|█████████▍| 47/50 [00:04<00:00, 14.80it/s]302)
 98%|█████████▊| 49/50 [00:04<00:00, 14.81it/s]302)
100%|██████████| 50/50 [00:04<00:00, 10.03it/s]302)
(ServeReplica:default:StableDiffusionV2 pid=113302) INFO 2024-03-12 22:58:41,668 default_StableDiffusionV2 jmk29epw 6178f374-a0db-4f82-890e-d97ac108a456 /imagine replica.py:366 - GENERATE OK 5452.7ms
(ServeReplica:default:APIIngress pid=108814) INFO 2024-03-12 22:58:41,775 default_APIIngress aj3gvtvh 6178f374-a0db-4f82-890e-d97ac108a456 /imagine replica.py:366 - __CALL__ OK 16842.9ms

这里是生成的图像:

A cat playing on grass

使用 Ray Tune 调整 XGBoost 分类器

在本节中，我们使用 XGBoost 在 Ray 上训练图像分类器。XGBoost 是一个用于分布式梯度提升的优化库。它已成为解决回归和分类问题的主要机器学习库。要深入了解梯度提升的工作原理，建议阅读 Introduction to Boosted Trees。

在本示例中，脚本 xgboost_example.py 使用 XGBoost 图像分类器来检测乳腺癌。Ray Tune 对 10 个不同的超参数设置进行采样，并对所有这些设置训练 XGBoost 分类器。`TrialScheduler` 可以提前停止表现不佳的试验以减少训练时间，从而将所有资源集中在表现较好的试验上。有关详细信息，请参阅官方 Ray 文档。

您可以使用 `curl` 快速下载此脚本：

curl https://raw.githubusercontent.com/ROCm/ray/master/python/ray/tune/examples/xgboost_example.py > xgboost_example.py

安装 scikit-learn 和 xgboost。然后运行脚本。

pip install scikit-learn
pip install xgboost
python xgboost_example.py

Usage stats collection is enabled by default for nightly wheels. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2024-03-07 00:31:55,362 INFO worker.py:1754 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
2024-03-07 00:31:56,477 INFO tune.py:220 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Tuner(...)`.
2024-03-07 00:31:56,478 INFO tune.py:592 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949
╭────────────────────────────────────────────────────────────────────────────╮
│ Configuration for experiment     train_breast_cancer_2024-03-07_00-31-53   │
├────────────────────────────────────────────────────────────────────────────┤
│ Search algorithm                 BasicVariantGenerator                     │
│ Scheduler                        AsyncHyperBandScheduler                   │
│ Number of trials                 10                                        │
╰────────────────────────────────────────────────────────────────────────────╯

View detailed results here: /root/ray_results/train_breast_cancer_2024-03-07_00-31-53
To visualize your results with TensorBoard, run: `tensorboard --logdir /root/ray_results/train_breast_cancer_2024-03-07_00-31-53`

Trial status: 10 PENDING
Current time: 2024-03-07 00:31:57. Total running time: 0s
Logical resource usage: 10.0/128 CPUs, 0/8 GPUs (0.0/1.0 accelerator_type:AMD-Instinct-MI210)
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name                        status       max_depth     min_child_weight     subsample           eta │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_breast_cancer_1a600_00000   PENDING              7                    1      0.98392    0.000278876 │
│ train_breast_cancer_1a600_00001   PENDING              6                    1      0.909173   0.00555103  │
│ train_breast_cancer_1a600_00002   PENDING              4                    2      0.78322    0.0216884   │
│ train_breast_cancer_1a600_00003   PENDING              4                    2      0.528893   0.0112016   │
│ train_breast_cancer_1a600_00004   PENDING              7                    2      0.541909   0.0597606   │
│ train_breast_cancer_1a600_00005   PENDING              6                    1      0.938674   0.00347829  │
│ train_breast_cancer_1a600_00006   PENDING              8                    1      0.883378   0.00682739  │
│ train_breast_cancer_1a600_00007   PENDING              1                    2      0.972819   0.0197333   │
│ train_breast_cancer_1a600_00008   PENDING              6                    2      0.576396   0.00416918  │
│ train_breast_cancer_1a600_00009   PENDING              4                    1      0.697624   0.00031904  │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Trial train_breast_cancer_1a600_00000 started with configuration:
╭───────────────────────────────────────────────────────────────────────╮
│ Trial train_breast_cancer_1a600_00000 config                          │
├───────────────────────────────────────────────────────────────────────┤
│ eta                                                           0.00028 │
│ eval_metric                                      ['logloss', 'error'] │
│ max_depth                                                           7 │
│ min_child_weight                                                    1 │
│ objective                                             binary:logistic │
│ subsample                                                     0.98392 │
╰───────────────────────────────────────────────────────────────────────╯

...
...
...

Trial train_breast_cancer_1a600_00004 completed after 10 iterations at 2024-03-07 00:31:59. Total running time: 2s
╭────────────────────────────────────────────────────────────────────╮
│ Trial train_breast_cancer_1a600_00004 result                       │
├────────────────────────────────────────────────────────────────────┤
│ checkpoint_dir_name                              checkpoint_000009 │
│ time_this_iter_s                                            0.0017 │
│ time_total_s                                               0.19301 │
│ training_iteration                                              10 │
│ test-error                                                 0.06993 │
│ test-logloss                                                0.3826 │
╰────────────────────────────────────────────────────────────────────╯

Trial status: 10 TERMINATED
Current time: 2024-03-07 00:31:59. Total running time: 2s
Logical resource usage: 1.0/128 CPUs, 0/8 GPUs (0.0/1.0 accelerator_type:AMD-Instinct-MI210)
Current best trial: 1a600_00004 with test-logloss=0.38260495725211563 and params={'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'max_depth': 7, 'min_child_weight': 2, 'subsample': 0.5419086005804928, 'eta': 0.05976055309805102}
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name                        status         max_depth     min_child_weight     subsample           eta     iter     total time (s)     test-logloss     test-error │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ train_breast_cancer_1a600_00000   TERMINATED             7                    1      0.98392    0.000278876        1          0.158494          0.65572       0.363636  │
│ train_breast_cancer_1a600_00001   TERMINATED             6                    1      0.909173   0.00555103         1          0.164058          0.678485      0.412587  │
│ train_breast_cancer_1a600_00002   TERMINATED             4                    2      0.78322    0.0216884          1          0.141597          0.629168      0.335664  │
│ train_breast_cancer_1a600_00003   TERMINATED             4                    2      0.528893   0.0112016         10          0.141248          0.560342      0.300699  │
│ train_breast_cancer_1a600_00004   TERMINATED             7                    2      0.541909   0.0597606         10          0.193006          0.382605      0.0699301 │
│ train_breast_cancer_1a600_00005   TERMINATED             6                    1      0.938674   0.00347829         1          0.175483          0.703689      0.447552  │
│ train_breast_cancer_1a600_00006   TERMINATED             8                    1      0.883378   0.00682739         2          0.163225          0.639305      0.34965   │
│ train_breast_cancer_1a600_00007   TERMINATED             1                    2      0.972819   0.0197333          1          0.167176          0.678488      0.426573  │
│ train_breast_cancer_1a600_00008   TERMINATED             6                    2      0.576396   0.00416918         1          0.0118954         0.652763      0.363636  │
│ train_breast_cancer_1a600_00009   TERMINATED             4                    1      0.697624   0.00031904         1          0.0119591         0.691917      0.426573  │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Best model parameters: {'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'max_depth': 7, 'min_child_weight': 2, 'subsample': 0.5419086005804928, 'eta': 0.05976055309805102}
Best model total accuracy: 0.9301
(train_breast_cancer pid=100759) [00:31:59] WARNING: /xgboost/src/c_api/c_api.cc:1348: Saving model in the UBJSON format as default.  You can use file extension: `json`, `ubj` or `deprecated` to choose between formats. [repeated 27x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(train_breast_cancer pid=100759) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/train_breast_cancer_2024-03-07_00-31-53/train_breast_cancer_1a600_00006_6_eta=0.0068,max_depth=8,min_child_weight=1,subsample=0.8834_2024-03-07_00-31-57/checkpoint_000001) [repeated 27x across cluster]

请注意，八个试验在只进行了几次迭代后就停止了，而不是完成了全部 10 次迭代。只有表现最好的两个试验完成了全部 10 次迭代。

总结

在这篇博客中，我们描述了如何使用Ray在多个AMD GPU上扩展AI应用程序。我们将在未来的博客中探讨如何使用Ray将AI应用程序扩展到多节点集群。

附录

Docker 文件

docker-compose.yaml

version: "3.7"
services:
  ray-blog:
    build:
      context: ..
      dockerfile: ./docker/dockerfile
    volumes:
      - ..:/root/
    devices:
      - /dev/kfd
      - /dev/dri
    command: /bin/bash

dockerfile

FROM rocm/dev-ubuntu-22.04:5.7-complete
ARG PY_VERSION=3.8

ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y software-properties-common \
    && add-apt-repository ppa:deadsnakes/ppa \
    && apt-get update && apt-get install -y python${PY_VERSION} python${PY_VERSION}-venv git

# 准备 python 虚拟环境，并添加到路径中
ENV PATH=/ray_venv/bin:$PATH
RUN python${PY_VERSION} -m venv ray_venv && python -m pip install --upgrade pip wheel

# 安装 Ray
RUN pip install "ray[data,train,tune,serve] @ https://github.com/ROCm/ray/releases/download/v3.0.0-dev0%2Brocm/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl"

# 安装torch
RUN --mount=type=cache,target=/root/.cache pip3 install torch==2.0.1 torchvision==0.15.2 -f https://repo.radeon.com/rocm/manylinux/rocm-rel-5.7/

# 安装额外的依赖
RUN pip3 install evaluate==0.4.1 \
    transformers==4.39.3 \
    accelerate==0.28.0 \
    scikit-learn==1.3.2 \
    requests==2.31.0 \
    diffusers==0.12.1

# 构建XGBoost
RUN git clone --depth=1 --recurse-submodules https://github.com/ROCmSoftwarePlatform/xgboost xgboost \
    && cd xgboost \
    && mkdir build && cd build\
    && export GFXARCH="$(rocm_agent_enumerator | tail -1)" \
    && export CMAKE_PREFIX_PATH=$CMAKE_PREFIX_PATH:/opt/rocm/lib/cmake:/opt/rocm/lib/cmake/AMDDeviceLibs/ \
    && cmake -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=${GFXARCH} -DUSE_RCCL=1 ../ \
    && make -j \
    && pip3 install ../python-package/

WORKDIR /root

在主机上运行

如果你不想使用 Docker，你也可以直接在你的机器上运行此博客，尽管这需要更多的工作。

必要条件:
- 安装ROCm 5.7.x
- 确保已安装 python 3.8

创建并激活 python 虚拟环境

python3.8 -m venv venv
source ./venv/bin/activate

安装 Ray whl:

pip install ray[data,train,tune,serve] @ https://github.com/ROCm/ray/releases/download/v3.0.0-dev0%2Brocm/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl

安装依赖:

# 安装torch
pip3 install torch==2.0.1 torchvision==0.15.2 -f https://repo.radeon.com/rocm/manylinux/rocm-rel-5.7/

# 安装额外依赖
pip3 install evaluate==0.4.1 \
    transformers==4.39.3 \
    accelerate==0.28.0 \
    scikit-learn==1.3.2 \
    requests==2.31.0 \
    diffusers==0.12.1

为 ROCm 安装 XGBoost（必须从源码构建）

cd $HOME
git clone --depth=1 --recurse-submodules https://github.com/ROCmSoftwarePlatform/xgboost
cd xgboost
mkdir build && cd build
export GFXARCH="$(rocm_agent_enumerator | tail -1)"
export CMAKE_PREFIX_PATH=$CMAKE_PREFIX_PATH:/opt/rocm/lib/cmake:/opt/rocm/lib/cmake/AMDDeviceLibs/
cmake -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=${GFXARCH} -DUSE_RCCL=1 ../
make -j
pip3 install ../python-package/

查看全文

http://www.kler.cn/a/378231.html