当前位置：首页 > article >正文

【diffusers极速入门（八）】GPU 显存节省（减少内存使用）技巧总结

article 2025/2/8 15:28:20

系列文章目录

【diffusers 极速入门（一）】pipeline 实际调用的是什么？ call 方法!
【diffusers 极速入门（二）】如何得到扩散去噪的中间结果？Pipeline callbacks 管道回调函数
【diffusers极速入门（三）】生成的图像尺寸与 UNet 和 VAE 之间的关系
【diffusers极速入门（四）】EMA 操作是什么？
【diffusers极速入门（五）】扩散模型中的 Scheduler（noise_scheduler）的作用是什么？
【diffusers极速入门（六）】缓存梯度和自动放缩学习率以及代码详解
【diffusers极速入门（七）】Classifier-Free Guidance （CFG）直观理解以及对应代码

文章目录

系列文章目录
总结
降低内存使用量
- 切片式变分自编码器（Sliced VAE）
- 平铺式变分自编码器（Tiled VAE）
- CPU卸载
- 模型卸载
- Channels-last内存格式
- 追踪
- 内存高效注意力机制

总结

函数名	作用
`enable_vae_slicing()`	在VRAM有限时解码大批量图像，一次只解码一张图像的潜在向量，搭配`enable_xformers_memory_efficient_attention()`可进一步减少内存占用
`enable_vae_tiling()`	在有限VRAM下处理大尺寸图像，将图像分割成有重叠小块分别解码再融合，搭配`enable_xformers_memory_efficient_attention()`可降低内存占用
`enable_sequential_cpu_offload()`	将模型权重卸载到CPU，只在执行前向传递时加载到GPU，可节省内存，但推理速度会变慢
`enable_model_cpu_offload()`	将整个模型移到GPU，相比顺序CPU卸载对推理时间影响小，同时节省一些内存
`enable_xformers_memory_efficient_attention()`	应用内存高效注意力机制，提升速度并降低GPU内存用量，安装PyTorch>1.12、有可用CUDA和xFormers后可使用

本文为对官方博客的翻译总结，原文🔗：https://huggingface.co/docs/diffusers/en/optimization/memory

降低内存使用量

扩散模型应用的一大阻碍，就是所需内存太大。不过别担心，有不少减少内存占用的技巧，用了这些方法，就算是一些超大型模型，也能在免费层级的GPU或者家用GPU上跑起来。而且，有些技巧还能搭配使用，进一步降低内存用量。

很多时候，优化内存或者优化速度，往往能让另一方面的性能也得到提升。所以，要是有条件，最好两个方面都优化一下。这篇指南主要讲怎么把内存用量降到最低，要是你还想了解怎么加快推理速度，也有相关内容供你学习。

下面的数据，是在英伟达泰坦RTX显卡上，用“火星上宇航员骑马的照片”这个提示词，通过50步DDIM采样，生成一张512x512尺寸图片得到的。这些数据展示了降低内存消耗后，推理速度能提升多少。

方法	延迟	速度提升倍数
原始方法	9.50秒	1倍
使用fp16精度	3.61秒	2.63倍
使用channels last格式	3.30秒	2.88倍
追踪UNet	3.21秒	2.96倍
使用内存高效注意力机制	2.63秒	3.61倍

切片式变分自编码器（Sliced VAE）

切片式VAE可以在VRAM（显存）有限的情况下，解码大批量图像，就算一次解码32张及以上的图像也没问题。它的原理是一次只解码一张图像的潜在向量。如果你安装了xFormers，最好搭配enable_xformers_memory_efficient_attention()函数一起用，这样能进一步减少内存占用。

想用切片式VAE，在推理之前，在你的模型管道上调用enable_vae_slicing()函数就行，代码示例如下：

import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    use_safetensors=True,
)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_vae_slicing()
#pipe.enable_xformers_memory_efficient_attention()
images = pipe([prompt] * 32).images

在解码多张图像时，VAE的性能可能会有小幅提升，而解码单张图像时，基本没什么影响。

平铺式变分自编码器（Tiled VAE）

平铺式VAE处理方法也能在有限的VRAM下处理大尺寸图像。比如说，用8GB的VRAM生成4K图像。它会把图像分割成有重叠的小块，分别解码这些小块，最后再把解码结果融合起来，得到最终图像。要是安装了xFormers，同样建议搭配enable_xformers_memory_efficient_attention()函数使用，以降低内存占用。

想用平铺式VAE处理，在推理前调用enable_vae_tiling()函数：

import torch
from diffusers import StableDiffusionPipeline, UniPCMultistepScheduler

pipe = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    use_safetensors=True,
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
prompt = "a beautiful landscape photograph"
pipe.enable_vae_tiling()
#pipe.enable_xformers_memory_efficient_attention()

image = pipe([prompt], width=3840, height=2224, num_inference_steps=20).images[0]

由于各个小块是分开解码的，输出图像可能会有一些块与块之间的色调差异，但不会出现明显的拼接缝。如果图像尺寸是512x512及更小，平铺功能会自动关闭。

CPU卸载

把模型权重卸载到CPU，只在执行前向传递时加载到GPU上，也能节省内存。通常，用这个方法能把内存消耗降到3GB以下。

想进行CPU卸载，调用enable_sequential_cpu_offload()函数：

import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    use_safetensors=True,
)

prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_sequential_cpu_offload()
image = pipe(prompt).images[0]

CPU卸载是对模型的子模块进行操作，而不是整个模型。这是降低内存用量的好办法，但因为扩散过程是迭代进行的，推理速度会慢很多。模型管道里的UNet组件会运行好几次（最多运行次数和num_inference_steps参数一样），每次都要根据需要顺序加载和卸载不同的UNet子模块，这就导致大量的内存数据传输。

要是你更注重速度，可以考虑使用模型卸载，虽然节省的内存没有CPU卸载多，但速度会快很多。

使用enable_sequential_cpu_offload()时，不要提前把模型管道移到CUDA设备上，不然内存节省效果会大打折扣（更多信息可以查看这个问题）。

enable_sequential_cpu_offload()是一个有状态的操作，它会在模型上安装钩子函数。

模型卸载

模型卸载要求🤗 Accelerate的版本在0.17.0及以上。

顺序CPU卸载虽然能节省大量内存，但会拖慢推理速度，因为子模块是按需移到GPU上的，新模块运行时，之前的子模块又会马上回到CPU。

全模型卸载是另一种选择，它会把整个模型移到GPU上，而不是分别处理模型的各个子模块。和直接把模型管道移到cuda设备相比，全模型卸载对推理时间的影响微乎其微，同时还能节省一些内存。

在模型卸载过程中，模型管道的主要组件（一般是文本编码器、UNet和VAE）只有一个会放在GPU上，其他组件则留在CPU上等待。像UNet这种要运行多次迭代的组件，会一直留在GPU上，直到不再需要。

在模型管道上调用enable_model_cpu_offload()函数，就能启用模型卸载：

import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    use_safetensors=True,
)

prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_model_cpu_offload()
image = pipe(prompt).images[0]

为了确保模型在调用后能正确卸载，需要按模型管道预期的顺序完整运行整个管道。如果在安装钩子函数后，在模型管道外复用模型，一定要小心。更多信息可以查看“移除钩子函数”部分。

enable_model_cpu_offload()是一个有状态的操作，它会在模型上安装钩子函数，并在模型管道上设置状态。

Channels-last内存格式

Channels-last内存格式是一种存储NCHW张量的替代方式，能保持维度顺序不变。在Channels-last格式下，通道维度会变成存储最紧密的维度（以逐像素的方式存储图像）。不过，不是所有的算子都支持这种格式，所以使用它可能会导致性能下降。但你还是可以试试看，说不定对你的模型有用。

比如，要让模型管道的UNet使用Channels-last格式，可以这样做：

print(pipe.unet.conv_out.state_dict()["weight"].stride())  # (2880, 9, 3, 1)
pipe.unet.to(memory_format=torch.channels_last)  # 原地操作

print(
    pipe.unet.conv_out.state_dict()["weight"].stride()
)  # (2880, 1, 960, 320) 第二个维度的步长为1，证明设置成功

追踪

追踪的原理是让一个示例输入张量通过模型，在这个输入张量经过模型各层时，捕捉对它执行的操作。返回的可执行对象或ScriptFunction会通过即时编译进行优化。

下面是追踪UNet的代码示例：

import time
import torch
from diffusers import StableDiffusionPipeline
import functools

# 禁用torch的梯度计算
torch.set_grad_enabled(False)

# 设置变量
n_experiments = 2
unet_runs_per_experiment = 50 
# 加载输入
def generate_inputs():
    sample = torch.randn((2, 4, 64, 64), device="cuda", dtype=torch.float16)
    timestep = torch.rand(1, device="cuda", dtype=torch.float16) * 999
    encoder_hidden_states = torch.randn((2, 77, 768), device="cuda", dtype=torch.float16)
    return sample, timestep, encoder_hidden_states


pipe = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    use_safetensors=True,
).to("cuda")
unet = pipe.unet
unet.eval()
unet.to(memory_format=torch.channels_last)  # 使用Channels-last内存格式
unet.forward = functools.partial(unet.forward, return_dict=False)  # 默认设置return_dict=False
# 热身
for _ in range(3):
    with torch.inference_mode():
        inputs = generate_inputs()
        orig_output = unet(*inputs)

# 追踪
print("tracing..")
unet_traced = torch.jit.trace(unet, inputs)
unet_traced.eval()
print("done tracing")

# 热身并优化图
for _ in range(5):
    with torch.inference_mode():
        inputs = generate_inputs()
        orig_output = unet_traced(*inputs)

# 基准测试
with torch.inference_mode():
    for _ in range(n_experiments):
        torch.cuda.synchronize()
        start_time = time.time()
        for _ in range(unet_runs_per_experiment):
            orig_output = unet_traced(*inputs)
        torch.cuda.synchronize()
        print(f"unet traced inference took {time.time() - start_time:.2f} seconds")
    for _ in range(n_experiments):
        torch.cuda.synchronize()
        start_time = time.time()
        for _ in range(unet_runs_per_experiment):
            orig_output = unet(*inputs)
        torch.cuda.synchronize()
        print(f"unet inference took {time.time() - start_time:.2f} seconds")

# 保存模型
unet_traced.save("unet_traced.pt")

用追踪后的模型替换模型管道的unet属性：

from diffusers import StableDiffusionPipeline
import torch
from dataclasses import dataclass


@dataclass
class UNet2DConditionOutput:
    sample: torch.Tensor


pipe = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    use_safetensors=True,
).to("cuda")

# 使用经过jit编译的unet
unet_traced = torch.jit.load("unet_traced.pt")

# 删除pipe.unet
class TracedUNet(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.in_channels = pipe.unet.config.in_channels
        self.device = pipe.unet.device

    def forward(self, latent_model_input, t, encoder_hidden_states):
        sample = unet_traced(latent_model_input, t, encoder_hidden_states)[0]
        return UNet2DConditionOutput(sample=sample)


pipe.unet = TracedUNet()

with torch.inference_mode():
    image = pipe([prompt] * 1, num_inference_steps=50).images[0]

内存高效注意力机制

最近在注意力模块带宽优化方面的研究成果，大幅提升了速度，还降低了GPU内存用量。目前最新的内存高效注意力机制是Flash Attention（你可以在HazyResearch/flash-attention查看原始代码）。

如果你安装的PyTorch版本大于等于2.0，启用xformers后，推理速度可能不会提升。

想用Flash Attention，需要安装以下内容：

PyTorch > 1.12
有可用的CUDA
xFormers

安装好后，在模型管道上调用enable_xformers_memory_efficient_attention()函数：

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    use_safetensors=True,
).to("cuda")

pipe.enable_xformers_memory_efficient_attention()

with torch.inference_mode():
    sample = pipe("a small cat")

# 可选：你可以通过下面这行代码禁用它
# pipe.disable_xformers_memory_efficient_attention()