用 Ray 扩展 AI 应用
Scale AI applications with Ray — ROCm Blogs (amd.com)
2024 年 4 月 1 日 by Vicky Tsang、Logan Grado、Eliot Li 编写
当今大多数机器学习(ML)工作负载需要多个 GPU 或节点才能达到应用程序所需的性能或规模。然而,将工作负载扩展到单节点/单 GPU 之外是困难的,并且需要一些分布式处理方面的专业知识。
Ray 开发了一个平台,可以使 AI 实践者只需几行代码就能将其代码从笔记本电脑扩展到节点集群。该平台旨在支持各种常见的机器学习用例(将在下一节中描述)。Ray 是一个在 Apache 2.0 许可证下的开源项目。
AMD 一直与 Ray 合作,以在 ROCm 平台上提供支持。在这篇博客中,我们将描述如何使用 Ray 轻松地将您的 AI 应用程序从笔记本电脑扩展到多个 AMD GPU。您可以在GitHub folder中找到与此博客相关的文件。
用例
Ray 可以应用于许多扩展机器学习应用的用例,例如:
- 大语言模型 (LLMs) 和生成 AI
- 批量推理
- 模型服务
- 超参数调优
- 分布式训练
- 强化学习
- 机器学习平台
- 端到端机器学习工作流
- 大规模工作负载编排请参阅 Ray 文档 获取这些用例的详细教程。我们将使用 AMD GPU 和 ROCm 探索一些常见的用例。
安装带有 ROCm 支持的 Ray
以下是如何在单节点上安装具有ROCm支持的Ray的简要说明。
先决条件
-
您需要一个支持ROCm的AMD GPU节点,参考ROCm支持的系统要求。
-
一个支持的Linux发行版,参考支持的操作系统。
-
ROCm - 请参阅安装说明。
为了运行本博客中的示例所需的设置
为了便于运行本博客中的示例,我们还提供了一个dockerfile,它将安装运行这些示例所需的预备条件。
• 从ROCm基本镜像开始
• 创建一个Python虚拟环境
• 安装具有ROCm支持的Ray
• 安装其他Python依赖项(如PyTorch等)
dockerfile
和`docker-compose.yaml`文件可以在这个GitHub文件夹中找到。为方便您的使用,我们也在附录中包含了这些文件。只需创建一个`docker`文件夹并添加这两个文件。然后可以使用以下命令构建并启动Docker容器,同时将本博客的目录挂载到`/root`路径下:
cd docker docker compose run ray-blog
提示
由于博客目录已挂载到容器中,您在博客目录中创建或编辑的任何文件将立即在容器内可用。
另外,您也可以构建自己的Python环境(请参阅附录中的在主机上运行)。
仅安装Ray
如果您仅想安装带有AMD支持的`ray`并用自己的代码进行测试,可以使用`pip install`来安装此适当的预构建包。这些预构建包已增加了ROCm支持。
pip install "ray[data,train,tune,serve] @ https://github.com/ROCm/ray/releases/download/v3.0.0-dev0%2Brocm/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl"
提示
一旦Ray发布了具有官方ROCm支持的版本,您就可以使用Ray安装说明。
示例
Now that we have a node ready to use Ray to scale our applications, we can illustrate how it works for these use cases:
-
Use Ray Train to fine-tune a transformer model
-
Convert an LLM model into a Ray Serve application
-
Use Ray Serve to serve a Stable Diffusion model
-
Use Ray Tune to tune an XGBoost classifier
使用 Ray Train 微调一个 Transformer 模型
下载脚本 transformers_torch_trainer_basic.py
,该脚本使用 Ray Train 库来扩展使用 Hugging Face 上 BERT base model 和 Yelp review dataset 的微调。
你可以使用 `curl` 快速下载该脚本:
curl https://raw.githubusercontent.com/ROCm/ray/master/python/ray/train/examples/transformers/transformers_torch_trainer_basic.py > transformers_torch_trainer_basic.py
让我们通过在 transformers_torch_trainer_basic.py
脚本的最后部分设置 num_workers=2
来使用两块 GPU 调整模型:
# [4] 构建一个 Ray TorchTrainer 来在所有工作节点上启动 `train_func` # ================================================================== trainer = TorchTrainer( train_func, scaling_config=ScalingConfig(num_workers=2, use_gpu=True) ) trainer.fit()
我们现在准备运行该脚本:
python transformers_torch_trainer_basic.py
输出应如下所示:
Usage stats collection is enabled by default for nightly wheels. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details. 2024-03-06 23:34:17,106 INFO worker.py:1754 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 2024-03-06 23:34:18,177 INFO tune.py:220 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Trainer(...)`. 2024-03-06 23:34:18,178 INFO tune.py:592 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949 View detailed results here: /root/ray_results/TorchTrainer_2024-03-06_23-34-14 To visualize your results with TensorBoard, run: `tensorboard --logdir /root/ray_results/TorchTrainer_2024-03-06_23-34-14` Training started without custom configuration. (RayTrainWorker pid=75298) Setting up process group for: env:// [rank=0, world_size=2] (TorchTrainer pid=71127) Started distributed worker processes: (TorchTrainer pid=71127) - (ip=10.216.70.82, pid=75298) world_rank=0, local_rank=0, node_rank=0 (TorchTrainer pid=71127) - (ip=10.216.70.82, pid=75299) world_rank=1, local_rank=1, node_rank=0 Downloading readme: 100%|██████████| 6.72k/6.72k [00:00<00:00, 39.9MB/s] Downloading data: 0%| | 0.00/299M [00:00<?, ?B/s] Downloading data: 1%|▏ | 4.19M/299M [00:00<00:39, 7.45MB/s] Downloading data: 4%|▍ | 12.6M/299M [00:00<00:20, 14.0MB/s] Downloading data: 7%|▋ | 21.0M/299M [00:01<00:14, 19.3MB/s] ... ... ... (RayTrainWorker pid=75298) {'eval_loss': 0.9538511633872986, 'eval_accuracy': 0.589, 'eval_runtime': 3.1428, 'eval_samples_per_second': 318.185, 'eval_steps_per_second': 20.046, 'epoch': 3.0} (RayTrainWorker pid=75298) {'train_runtime': 39.3335, 'train_samples_per_second': 76.271, 'train_steps_per_second': 4.805, 'train_loss': 1.1883205837673612, 'epoch': 3.0} 100%|██████████| 189/189 [00:39<00:00, 4.81it/s] Training completed after 0 iterations at 2024-03-06 23:38:20. Total running time: 4min 2s
使用两块 GPU,大约需要 4 分钟完成微调。让我们尝试用四块 GPU 运行相同的任务(假设你的系统中至少有四块 GPU)。通过将 num_workers
从 2
更改为 4
进行此操作:
# [4] 构建一个 Ray TorchTrainer 来在所有工作节点上启动 `train_func` # ================================================================== trainer = TorchTrainer( train_func, scaling_config=ScalingConfig(num_workers=4, use_gpu=True) ) trainer.fit()
再次运行该脚本应产生以下输出:
Usage stats collection is enabled by default for nightly wheels. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details. 2024-03-06 23:49:10,338 INFO worker.py:1754 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 2024-03-06 23:49:11,468 INFO tune.py:220 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Trainer(...)`. 2024-03-06 23:49:11,469 INFO tune.py:592 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949 View detailed results here: /root/ray_results/TorchTrainer_2024-03-06_23-49-08 To visualize your results with TensorBoard, run: `tensorboard --logdir /root/ray_results/TorchTrainer_2024-03-06_23-49-08` Training started without custom configuration. (RayTrainWorker pid=83857) Setting up process group for: env:// [rank=0, world_size=4] (TorchTrainer pid=83721) Started distributed worker processes: (TorchTrainer pid=83721) - (ip=10.216.70.82, pid=83857) world_rank=0, local_rank=0, node_rank=0 (TorchTrainer pid=83721) - (ip=10.216.70.82, pid=83858) world_rank=1, local_rank=1, node_rank=0 (TorchTrainer pid=83721) - (ip=10.216.70.82, pid=83859) world_rank=2, local_rank=2, node_rank=0 (TorchTrainer pid=83721) - (ip=10.216.70.82, pid=83860) world_rank=3, local_rank=3, node_rank=0 Map: 0%| | 0/50000 [00:00<?, ? examples/s] Map: 2%|▏ | 1000/50000 [00:00<00:13, 3510.56 examples/s] Map: 0%| | 0/50000 [00:00<?, ? examples/s] [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.) Map: 40%|████ | 20000/50000 [00:05<00:07, 3945.42 examples/s] [repeated 75x across cluster] Map: 80%|████████ | 40000/50000 [00:10<00:02, 4170.23 examples/s] [repeated 82x across cluster] Map: 92%|█████████▏| 46000/50000 [00:11<00:00, 4377.90 examples/s] Map: 100%|██████████| 50000/50000 [00:12<00:00, 3899.61 examples/s] ... ... ... (RayTrainWorker pid=83857) 01<00:00, 20.58it/s] 100%|██████████| 96/96 [00:20<00:00, 4.59it/s] (RayTrainWorker pid=83857) {'eval_loss': 0.9640204906463623, 'eval_accuracy': 0.578, 'eval_runtime': 1.7283, 'eval_samples_per_second': 578.6, 'eval_steps_per_second': 18.515, 'epoch': 3.0} (RayTrainWorker pid=83857) {'train_runtime': 20.9123, 'train_samples_per_second': 143.456, 'train_steps_per_second': 4.591, 'train_loss': 1.1330984433492024, 'epoch': 3.0} Training completed after 0 iterations at 2024-03-06 23:49:59. Total running time: 47s
使用这两块额外的 GPU,相同任务花费了 47 秒。借助 Ray,通过更改代码中的一个参数,即可轻松扩展任务所需的资源。
将 LLM 模型转换为 Ray Serve 应用程序
我们可以在本地开发一个 Ray Serve 应用程序,并在生产集群(使用 AMD GPU)上部署。只需几行代码即可实现。详细说明请参考官方的Ray 文档页面.
我们以英语到法语翻译为例部署一个 ML 应用程序。首先,我们基于Ray 文档页面的 model.py
脚本创建 Python 脚本 RayServe_En2Fr_translation_local.py
,用于将英语文本翻译成法语。
# 文件名: RayServe_En2Fr_translation_local.py from transformers import pipeline class Translator: def __init__(self): # 加载模型 self.model = pipeline("translation_en_to_fr", model="t5-small") def translate(self, text: str) -> str: # 运行推断 model_output = self.model(text) # 后处理输出,只返回翻译文本 translation = model_output[0]["translation_text"] return translation translator = Translator() translation = translator.translate("Hello world!") print(translation)
我们可以通过运行以下命令在本地测试脚本:
python RayServe_En2Fr_translation_local.py
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.21k/1.21k [00:00<00:00, 612kB/s] model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 242M/242M [00:02<00:00, 115MB/s] generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 147/147 [00:00<00:00, 93.6kB/s] tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.32k/2.32k [00:00<00:00, 2.76MB/s] spiece.model: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 792k/792k [00:00<00:00, 11.7MB/s] tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.39M/1.39M [00:00<00:00, 5.76MB/s] Bonjour monde!
接下来,我们将这个脚本转换为 RayServe_En2Fr_translation.py
,它支持使用 FastAPI 的 Ray Serve 应用程序,基于 Ray 文档页面 上的说明。
# 文件名: RayServe_En2Fr_translation.py import ray from ray import serve from fastapi import FastAPI from transformers import pipeline app = FastAPI() @serve.deployment(num_replicas=2, ray_actor_options={"num_cpus": 0.2, "num_gpus": 0}) @serve.ingress(app) class Translator: def __init__(self): # 加载模型 self.model = pipeline("translation_en_to_fr", model="t5-small") @app.post("/") def translate(self, text: str) -> str: # 运行推断 model_output = self.model(text) # 后处理输出,只返回翻译文本 translation = model_output[0]["translation_text"] return translation translator_app = Translator.bind()
我们在后台启动 translator_app
应用,以为将英语翻译成法语的 LLM 模型提供服务。我们使用 serve run
CLI 命令运行脚本,该命令接受格式为 <module>:<application>
的导入路径。
在包含 RayServe_En2Fr_translation.py
脚本的目录中运行命令,以便它可以导入应用程序:
serve run RayServe_En2Fr_translation:translator_app &
预计输出如下:
2024-02-01 16:30:33,699 INFO scripts.py:413 -- Running import path: 'RayServe_En2Fr_translation:translator_app'. Usage stats collection is enabled by default for nightly wheels. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details. 2024-02-01 16:30:37,719 INFO worker.py:1753 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8266 (ProxyActor pid=333236) INFO 2024-02-01 16:30:41,275 proxy 10.216.70.84 proxy.py:1145 - Proxy actor d37bfb4a9e388b83347f768101000000 starting on node a5c7469f3d4184bfb971897878429cc42cfd73d4fb484fab478f6215. (ProxyActor pid=333236) INFO 2024-02-01 16:30:41,278 proxy 10.216.70.84 proxy.py:1357 - Starting HTTP server on node: a5c7469f3d4184bfb971897878429cc42cfd73d4fb484fab478f6215 listening on port 8000 (ProxyActor pid=333236) INFO: Started server process [333236] (ServeController pid=333144) INFO 2024-02-01 16:30:41,379 controller 333144 deployment_state.py:1580 - Deploying new version of deployment Translator in application 'default'. Setting initial target number of replicas to 2. (ServeController pid=333144) INFO 2024-02-01 16:30:41,481 controller 333144 deployment_state.py:1865 - Adding 2 replicas to deployment Translator in application 'default'. 2024-02-01 16:30:45,323 SUCC scripts.py:457 -- Deployed app successfully.
服务器在我们的集群中设置好之后,我们可以使用 RayServe_En2Fr_translation_client.py
脚本在本地测试应用。这是Ray 文档页面中的 model_client.py
脚本,我们重命名为 RayServe_En2Fr_translation_client.py
。它发送包含英文文本的 POST
请求(以 JSON 格式)。
# 文件名: RayServe_En2Fr_tranlation_client.py import requests response = requests.post("http://127.0.0.1:8000/", params={"text": "Hello world!"}) french_text = response.json() print(french_text)
此客户端脚本请求翻译短语“Hello world!”:
python RayServe_En2Fr_tranlation_client.py
预计输出如下:
Bonjour monde! (ServeReplica:default:Translator pid=333328) INFO 2024-02-01 16:38:02,251 default_Translator hxagjhct 6625dbbe-cbba-40cd-ba86-5cbff9cb6aa2 / replica.py:380 - __CALL__ OK 192.1ms
利用 Ray Serve 来提供 Stable Diffusion 模型的服务
稳定扩散(Stable Diffusion)是最受欢迎的图像生成模型之一。它接收一个文本提示,并根据提示的意思生成图像。
在这个示例中,我们使用 Ray 来启动一个stabilityai/stable-diffusion-2-1-base模型的服务器,该服务器的 API 由 FastAPI提供支持。
要运行此示例,请安装以下内容:
pip install requests diffusers==0.12.1 transformers
创建一个带有 Serve
代码的 Python 脚本,按照 Ray 文档保存为 RayServe_StableDiffusion.py
。
# 文件名: RayServe_StableDiffusion.py from io import BytesIO from fastapi import FastAPI from fastapi.responses import Response import torch from ray import serve from ray.serve.handle import DeploymentHandle app = FastAPI() @serve.deployment(num_replicas=1) @serve.ingress(app) class APIIngress: def __init__(self, diffusion_model_handle: DeploymentHandle) -> None: self.handle = diffusion_model_handle @app.get( "/imagine", responses={200: {"content": {"image/png": {}}}}, response_class=Response, ) async def generate(self, prompt: str, img_size: int = 512): assert len(prompt), "prompt parameter cannot be empty" image = await self.handle.generate.remote(prompt, img_size=img_size) file_stream = BytesIO() image.save(file_stream, "PNG") return Response(content=file_stream.getvalue(), media_type="image/png") @serve.deployment( ray_actor_options={"num_gpus": 1}, autoscaling_config={"min_replicas": 0, "max_replicas": 2}, ) class StableDiffusionV2: def __init__(self): from diffusers import EulerDiscreteScheduler, StableDiffusionPipeline model_id = "stabilityai/stable-diffusion-2" scheduler = EulerDiscreteScheduler.from_pretrained( model_id, subfolder="scheduler" ) self.pipe = StableDiffusionPipeline.from_pretrained( model_id, scheduler=scheduler, revision="fp16", torch_dtype=torch.float16 ) self.pipe = self.pipe.to("cuda") def generate(self, prompt: str, img_size: int = 512): assert len(prompt), "prompt parameter cannot be empty" with torch.autocast("cuda"): image = self.pipe(prompt, height=img_size, width=img_size).images[0] return image entrypoint = APIIngress.bind(StableDiffusionV2.bind())
使用以下命令启动 Serve 应用程序:
serve run RayServe_StableDiffusion:entrypoint &
预期输出如下:
2024-03-12 23:00:21,290 INFO scripts.py:502 -- Running import path: 'RayServe_StableDiffusion:entrypoint'. 2024-03-12 23:00:22,443 WARNING api.py:424 -- The default value for `max_ongoing_requests` is currently 100, but will change to 5 in the next upcoming release. 2024-03-12 23:00:22,453 WARNING api.py:364 -- The default value for `target_ongoing_requests` is currently 1.0, but will change to 2.0 in an upcoming release. 2024-03-12 23:00:22,453 WARNING api.py:424 -- The default value for `max_ongoing_requests` is currently 100, but will change to 5 in the next upcoming release. Usage stats collection is enabled by default for nightly wheels. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details. 2024-03-12 23:00:24,606 INFO worker.py:1743 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 (ProxyActor pid=121115) INFO 2024-03-12 23:00:28,216 proxy 10.216.70.82 proxy.py:1160 - Proxy starting on node 724ae23b5a7bfda10da26d36de8efc5af99cd7d7b1ddbb4379810ff5 (HTTP port: 8000). (ServeController pid=121023) INFO 2024-03-12 23:00:28,316 controller 121023 deployment_state.py:1581 - Deploying new version of Deployment(name='StableDiffusionV2', app='default') (initial target replicas: 0). (ServeController pid=121023) INFO 2024-03-12 23:00:28,317 controller 121023 deployment_state.py:1581 - Deploying new version of Deployment(name='APIIngress', app='default') (initial target replicas: 1). (ServeController pid=121023) INFO 2024-03-12 23:00:28,419 controller 121023 deployment_state.py:1883 - Adding 1 replica to Deployment(name='APIIngress', app='default'). 2024-03-12 23:00:30,262 INFO api.py:601 -- Deployed app 'default' successfully.
现在我们可以通过 API 向服务器发送请求。创建脚本 RayServe_StableDiffusion_client.py
,使用 Ray 文档中的客户端代码。
# 文件名: RayServe_StableDiffusion_client.py import requests prompt = "a cute cat is dancing on the grass." input = "%20".join(prompt.split(" ")) resp = requests.get(f"http://127.0.0.1:8000/imagine?prompt={input}") with open("output.png", 'wb') as f: f.write(resp.content)
运行 RayServe_StableDiffusion_client.py
脚本,向这个应用程序发送提示为“a cute cat is dancing on the grass.”的请求。
python RayServe_StableDiffusion_client.py
生成的图像被保存为 output.png
。
预期输出如下:
(ServeController pid=108630) INFO 2024-03-12 22:58:24,938 controller 108630 deployment_state.py:1648 - Autoscaling Deployment(name='StableDiffusionV2', app='default') to 1 replicas. Current num requests: 1, current num running replicas: 0. (ServeController pid=108630) INFO 2024-03-12 22:58:24,939 controller 108630 deployment_state.py:1883 - Adding 1 replica to Deployment(name='StableDiffusionV2', app='default'). Fetching 12 files: 100%|██████████| 12/12 [00:00<00:00, 158774.91it/s] (ServeReplica:default:StableDiffusionV2 pid=113302) Cannot initialize model with low cpu memory usage because `accelerate` was not found in the environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install `accelerate` for faster and less memory-intense model loading. You can do so with: (ServeReplica:default:StableDiffusionV2 pid=113302) ``` (ServeReplica:default:StableDiffusionV2 pid=113302) pip install accelerate (ServeReplica:default:StableDiffusionV2 pid=113302) ``` (ServeReplica:default:StableDiffusionV2 pid=113302) . (ServeReplica:default:StableDiffusionV2 pid=113302) /opt/conda/envs/ray_py3.8/lib/python3.8/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() (ServeReplica:default:StableDiffusionV2 pid=113302) return self.fget.__get__(instance, owner)() 0%| | 0/50 [00:00<?, ?it/s]2 pid=113302) 2%|▏ | 1/50 [00:00<00:13, 3.54it/s]3302) 6%|▌ | 3/50 [00:01<00:30, 1.56it/s]3302) 10%|█ | 5/50 [00:01<00:15, 2.89it/s]3302) 14%|█▍ | 7/50 [00:02<00:09, 4.39it/s]3302) 18%|█▊ | 9/50 [00:02<00:06, 5.97it/s]3302) 22%|██▏ | 11/50 [00:02<00:05, 7.55it/s]302) 26%|██▌ | 13/50 [00:02<00:04, 9.02it/s]302) 30%|███ | 15/50 [00:02<00:03, 10.32it/s]302) 34%|███▍ | 17/50 [00:02<00:02, 11.42it/s]302) 38%|███▊ | 19/50 [00:02<00:02, 12.30it/s]302) 42%|████▏ | 21/50 [00:03<00:02, 12.98it/s]302) 46%|████▌ | 23/50 [00:03<00:02, 13.49it/s]302) 50%|█████ | 25/50 [00:03<00:01, 13.87it/s]302) 54%|█████▍ | 27/50 [00:03<00:01, 14.14it/s]302) 58%|█████▊ | 29/50 [00:03<00:01, 14.33it/s]302) 62%|██████▏ | 31/50 [00:03<00:01, 14.47it/s]302) 66%|██████▌ | 33/50 [00:03<00:01, 14.57it/s]302) 70%|███████ | 35/50 [00:03<00:01, 14.65it/s]302) 74%|███████▍ | 37/50 [00:04<00:00, 14.70it/s]302) 78%|███████▊ | 39/50 [00:04<00:00, 14.73it/s]302) 82%|████████▏ | 41/50 [00:04<00:00, 14.76it/s]302) 86%|████████▌ | 43/50 [00:04<00:00, 14.78it/s]302) 90%|█████████ | 45/50 [00:04<00:00, 14.79it/s]302) 94%|█████████▍| 47/50 [00:04<00:00, 14.80it/s]302) 98%|█████████▊| 49/50 [00:04<00:00, 14.81it/s]302) 100%|██████████| 50/50 [00:04<00:00, 10.03it/s]302) (ServeReplica:default:StableDiffusionV2 pid=113302) INFO 2024-03-12 22:58:41,668 default_StableDiffusionV2 jmk29epw 6178f374-a0db-4f82-890e-d97ac108a456 /imagine replica.py:366 - GENERATE OK 5452.7ms (ServeReplica:default:APIIngress pid=108814) INFO 2024-03-12 22:58:41,775 default_APIIngress aj3gvtvh 6178f374-a0db-4f82-890e-d97ac108a456 /imagine replica.py:366 - __CALL__ OK 16842.9ms
这里是生成的图像:
使用 Ray Tune 调整 XGBoost 分类器
在本节中,我们使用 XGBoost 在 Ray 上训练图像分类器。XGBoost 是一个用于分布式梯度提升的优化库。它已成为解决回归和分类问题的主要机器学习库。要深入了解梯度提升的工作原理,建议阅读 Introduction to Boosted Trees。
在本示例中,脚本 xgboost_example.py
使用 XGBoost 图像分类器来检测乳腺癌。Ray Tune 对 10 个不同的超参数设置进行采样,并对所有这些设置训练 XGBoost 分类器。`TrialScheduler` 可以提前停止表现不佳的试验以减少训练时间,从而将所有资源集中在表现较好的试验上。有关详细信息,请参阅官方 Ray 文档。
您可以使用 `curl` 快速下载此脚本:
curl https://raw.githubusercontent.com/ROCm/ray/master/python/ray/tune/examples/xgboost_example.py > xgboost_example.py
安装 scikit-learn
和 xgboost
。然后运行脚本。
pip install scikit-learn pip install xgboost python xgboost_example.py
Usage stats collection is enabled by default for nightly wheels. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details. 2024-03-07 00:31:55,362 INFO worker.py:1754 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 2024-03-07 00:31:56,477 INFO tune.py:220 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Tuner(...)`. 2024-03-07 00:31:56,478 INFO tune.py:592 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949 ╭────────────────────────────────────────────────────────────────────────────╮ │ Configuration for experiment train_breast_cancer_2024-03-07_00-31-53 │ ├────────────────────────────────────────────────────────────────────────────┤ │ Search algorithm BasicVariantGenerator │ │ Scheduler AsyncHyperBandScheduler │ │ Number of trials 10 │ ╰────────────────────────────────────────────────────────────────────────────╯ View detailed results here: /root/ray_results/train_breast_cancer_2024-03-07_00-31-53 To visualize your results with TensorBoard, run: `tensorboard --logdir /root/ray_results/train_breast_cancer_2024-03-07_00-31-53` Trial status: 10 PENDING Current time: 2024-03-07 00:31:57. Total running time: 0s Logical resource usage: 10.0/128 CPUs, 0/8 GPUs (0.0/1.0 accelerator_type:AMD-Instinct-MI210) ╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Trial name status max_depth min_child_weight subsample eta │ ├───────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ train_breast_cancer_1a600_00000 PENDING 7 1 0.98392 0.000278876 │ │ train_breast_cancer_1a600_00001 PENDING 6 1 0.909173 0.00555103 │ │ train_breast_cancer_1a600_00002 PENDING 4 2 0.78322 0.0216884 │ │ train_breast_cancer_1a600_00003 PENDING 4 2 0.528893 0.0112016 │ │ train_breast_cancer_1a600_00004 PENDING 7 2 0.541909 0.0597606 │ │ train_breast_cancer_1a600_00005 PENDING 6 1 0.938674 0.00347829 │ │ train_breast_cancer_1a600_00006 PENDING 8 1 0.883378 0.00682739 │ │ train_breast_cancer_1a600_00007 PENDING 1 2 0.972819 0.0197333 │ │ train_breast_cancer_1a600_00008 PENDING 6 2 0.576396 0.00416918 │ │ train_breast_cancer_1a600_00009 PENDING 4 1 0.697624 0.00031904 │ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯ Trial train_breast_cancer_1a600_00000 started with configuration: ╭───────────────────────────────────────────────────────────────────────╮ │ Trial train_breast_cancer_1a600_00000 config │ ├───────────────────────────────────────────────────────────────────────┤ │ eta 0.00028 │ │ eval_metric ['logloss', 'error'] │ │ max_depth 7 │ │ min_child_weight 1 │ │ objective binary:logistic │ │ subsample 0.98392 │ ╰───────────────────────────────────────────────────────────────────────╯ ... ... ... Trial train_breast_cancer_1a600_00004 completed after 10 iterations at 2024-03-07 00:31:59. Total running time: 2s ╭────────────────────────────────────────────────────────────────────╮ │ Trial train_breast_cancer_1a600_00004 result │ ├────────────────────────────────────────────────────────────────────┤ │ checkpoint_dir_name checkpoint_000009 │ │ time_this_iter_s 0.0017 │ │ time_total_s 0.19301 │ │ training_iteration 10 │ │ test-error 0.06993 │ │ test-logloss 0.3826 │ ╰────────────────────────────────────────────────────────────────────╯ Trial status: 10 TERMINATED Current time: 2024-03-07 00:31:59. Total running time: 2s Logical resource usage: 1.0/128 CPUs, 0/8 GPUs (0.0/1.0 accelerator_type:AMD-Instinct-MI210) Current best trial: 1a600_00004 with test-logloss=0.38260495725211563 and params={'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'max_depth': 7, 'min_child_weight': 2, 'subsample': 0.5419086005804928, 'eta': 0.05976055309805102} ╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Trial name status max_depth min_child_weight subsample eta iter total time (s) test-logloss test-error │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ train_breast_cancer_1a600_00000 TERMINATED 7 1 0.98392 0.000278876 1 0.158494 0.65572 0.363636 │ │ train_breast_cancer_1a600_00001 TERMINATED 6 1 0.909173 0.00555103 1 0.164058 0.678485 0.412587 │ │ train_breast_cancer_1a600_00002 TERMINATED 4 2 0.78322 0.0216884 1 0.141597 0.629168 0.335664 │ │ train_breast_cancer_1a600_00003 TERMINATED 4 2 0.528893 0.0112016 10 0.141248 0.560342 0.300699 │ │ train_breast_cancer_1a600_00004 TERMINATED 7 2 0.541909 0.0597606 10 0.193006 0.382605 0.0699301 │ │ train_breast_cancer_1a600_00005 TERMINATED 6 1 0.938674 0.00347829 1 0.175483 0.703689 0.447552 │ │ train_breast_cancer_1a600_00006 TERMINATED 8 1 0.883378 0.00682739 2 0.163225 0.639305 0.34965 │ │ train_breast_cancer_1a600_00007 TERMINATED 1 2 0.972819 0.0197333 1 0.167176 0.678488 0.426573 │ │ train_breast_cancer_1a600_00008 TERMINATED 6 2 0.576396 0.00416918 1 0.0118954 0.652763 0.363636 │ │ train_breast_cancer_1a600_00009 TERMINATED 4 1 0.697624 0.00031904 1 0.0119591 0.691917 0.426573 │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ Best model parameters: {'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'max_depth': 7, 'min_child_weight': 2, 'subsample': 0.5419086005804928, 'eta': 0.05976055309805102} Best model total accuracy: 0.9301 (train_breast_cancer pid=100759) [00:31:59] WARNING: /xgboost/src/c_api/c_api.cc:1348: Saving model in the UBJSON format as default. You can use file extension: `json`, `ubj` or `deprecated` to choose between formats. [repeated 27x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.) (train_breast_cancer pid=100759) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/train_breast_cancer_2024-03-07_00-31-53/train_breast_cancer_1a600_00006_6_eta=0.0068,max_depth=8,min_child_weight=1,subsample=0.8834_2024-03-07_00-31-57/checkpoint_000001) [repeated 27x across cluster]
请注意,八个试验在只进行了几次迭代后就停止了,而不是完成了全部 10 次迭代。只有表现最好的两个试验完成了全部 10 次迭代。
总结
在这篇博客中,我们描述了如何使用Ray在多个AMD GPU上扩展AI应用程序。我们将在未来的博客中探讨如何使用Ray将AI应用程序扩展到多节点集群。
附录
Docker 文件
docker-compose.yaml
version: "3.7" services: ray-blog: build: context: .. dockerfile: ./docker/dockerfile volumes: - ..:/root/ devices: - /dev/kfd - /dev/dri command: /bin/bash
dockerfile
FROM rocm/dev-ubuntu-22.04:5.7-complete ARG PY_VERSION=3.8 ARG DEBIAN_FRONTEND=noninteractive RUN apt-get update && apt-get install -y software-properties-common \ && add-apt-repository ppa:deadsnakes/ppa \ && apt-get update && apt-get install -y python${PY_VERSION} python${PY_VERSION}-venv git # 准备 python 虚拟环境,并添加到路径中 ENV PATH=/ray_venv/bin:$PATH RUN python${PY_VERSION} -m venv ray_venv && python -m pip install --upgrade pip wheel # 安装 Ray RUN pip install "ray[data,train,tune,serve] @ https://github.com/ROCm/ray/releases/download/v3.0.0-dev0%2Brocm/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl" # 安装torch RUN --mount=type=cache,target=/root/.cache pip3 install torch==2.0.1 torchvision==0.15.2 -f https://repo.radeon.com/rocm/manylinux/rocm-rel-5.7/ # 安装额外的依赖 RUN pip3 install evaluate==0.4.1 \ transformers==4.39.3 \ accelerate==0.28.0 \ scikit-learn==1.3.2 \ requests==2.31.0 \ diffusers==0.12.1 # 构建XGBoost RUN git clone --depth=1 --recurse-submodules https://github.com/ROCmSoftwarePlatform/xgboost xgboost \ && cd xgboost \ && mkdir build && cd build\ && export GFXARCH="$(rocm_agent_enumerator | tail -1)" \ && export CMAKE_PREFIX_PATH=$CMAKE_PREFIX_PATH:/opt/rocm/lib/cmake:/opt/rocm/lib/cmake/AMDDeviceLibs/ \ && cmake -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=${GFXARCH} -DUSE_RCCL=1 ../ \ && make -j \ && pip3 install ../python-package/ WORKDIR /root
在主机上运行
如果你不想使用 Docker,你也可以直接在你的机器上运行此博客,尽管这需要更多的工作。
-
必要条件:
-
安装ROCm 5.7.x
-
确保已安装 python 3.8
-
-
创建并激活 python 虚拟环境
python3.8 -m venv venv source ./venv/bin/activate
-
安装 Ray whl:
pip install ray[data,train,tune,serve] @ https://github.com/ROCm/ray/releases/download/v3.0.0-dev0%2Brocm/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl
-
安装依赖:
# 安装torch pip3 install torch==2.0.1 torchvision==0.15.2 -f https://repo.radeon.com/rocm/manylinux/rocm-rel-5.7/ # 安装额外依赖 pip3 install evaluate==0.4.1 \ transformers==4.39.3 \ accelerate==0.28.0 \ scikit-learn==1.3.2 \ requests==2.31.0 \ diffusers==0.12.1
-
为 ROCm 安装 XGBoost(必须从源码构建)
cd $HOME git clone --depth=1 --recurse-submodules https://github.com/ROCmSoftwarePlatform/xgboost cd xgboost mkdir build && cd build export GFXARCH="$(rocm_agent_enumerator | tail -1)" export CMAKE_PREFIX_PATH=$CMAKE_PREFIX_PATH:/opt/rocm/lib/cmake:/opt/rocm/lib/cmake/AMDDeviceLibs/ cmake -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=${GFXARCH} -DUSE_RCCL=1 ../ make -j pip3 install ../python-package/