当前位置：首页 > article >正文

【AIGC】2023-ICCV-为文本到图像的传播模型添加条件控制

article 2024/10/25 10:41:57

2023-ICCV-Adding Conditional Control to Text-to-Image Diffusion Models

为文本到图像的传播模型添加条件控制
- 摘要
- 1. 引言
- 2. 相关工作
- - 2.1 微调神经网络
  - 2.2 图像扩散
  - 2.3 图像到图像的转换
- 3. 方法
- - 3.1 ControlNet
  - 3.2 用于文本到图像扩散的 ControlNet
  - 3.3 训练
  - 3.4 推理
- 4. 实验
- - 4.1 定性结果
  - 4.2 消融研究
  - 4.3 定量评估
  - 4.4 与以前方法的比较
  - 4.5 讨论
- 5. 结论
- 致谢
- 参考文献

为文本到图像的传播模型添加条件控制

作者：Lvmin Zhang, Anyi Rao, and Maneesh Agrawala
单位：Stanford University
论文地址：2023-ICCV-Adding Conditional Control to Text-to-Image Diffusion Models

图 1

图 1：使用学习条件控制 Stable Diffusion。ControlNet 允许用户添加 Canny 边缘（顶部）、人体姿势（底部）等条件，以控制大型预训练扩散模型的图像生成。默认结果使用提示“高质量、详细且专业的图像”。用户可以选择给出提示，例如“厨房里的厨师”。

摘要

我们提出了 ControlNet，这是一种神经网络架构，用于向大型、预训练的文本到图像扩散模型添加空间调节控制。ControlNet 锁定了可用于生产的大型扩散模型，并重新使用经过数十亿张图像预训练的深度和稳健编码层作为强大的主干，以学习一组多样化的条件控制。神经架构与“零卷积”（零初始化卷积层）相连，这些卷积层从零开始逐步增加参数并确保没有有害噪音会影响微调。我们使用 Stable Diffusion 测试了各种条件控制，例如边缘、深度、分割、人体姿势等，使用单个或多个条件，有或没有提示。我们表明，Control-Nets 的训练对于小型（<50k）和大型（>1m）数据集都很稳健。大量结果表明，ControlNet 可以促进更广泛的应用来控制图像扩散模型。

1. 引言

我们中的许多人都经历过视觉灵感的闪现，希望将其捕捉到一张独特的图像中。随着文本到图像扩散模型的出现 [54, 61, 71]，我们现在可以通过输入文本提示来创建视觉上令人惊叹的图像。然而，文本到图像模型在控制图像的空间构成方面受到限制；仅通过文本提示很难精确表达复杂的布局、姿势、形状和形式。生成与我们的心理意象准确匹配的图像通常需要多次反复试验，包括编辑提示、检查生成的图像，然后重新编辑提示。

我们能否通过让用户提供直接指定其所需图像组成的附加图像来实现更细粒度的空间控制？在计算机视觉和机器学习中，这些附加图像（例如边缘图、人体姿势骨架、分割图、深度、法线等）通常被视为图像生成过程的条件。图像到图像的转换模型 [34, 97] 学习从条件图像到目标图像的映射。研究界还采取措施，使用空间掩码 [6, 20]、图像编辑指令 [10]、通过微调进行个性化 [21, 74] 等来控制文本到图像的模型。虽然一些问题（例如生成图像变化、修复）可以通过无需训练的技术（例如限制去噪扩散过程或编辑注意层激活）解决，但更多种类的问题（如深度到图像、姿势到图像等）需要端到端学习和数据驱动的解决方案。

以端到端的方式学习大型文本到图像扩散模型的条件控制具有挑战性。特定条件的训练数据量可能比一般文本到图像训练可用的数据量小得多。例如，各种特定问题（例如，物体形状/法线、人体姿势提取等）的最大数据集通常约为 100K，比用于训练 Stable Diffusion [81] 的 LAION-5B [78] 数据集小 50,000 倍。使用有限数据直接微调或继续训练大型预训练模型可能会导致过度拟合和灾难性遗忘 [31, 74]。研究人员已经表明，通过限制可训练参数的数量或等级可以缓解这种遗忘 [14, 25, 31, 91]。对于我们的问题，设计更深层或更定制化的神经架构对于处理具有复杂形状和多样化高级语义的野外条件图像可能是必要的。

本文介绍了 ControlNet，这是一种端到端神经网络架构，用于学习大型预训练文本到图像扩散模型（在我们的实现中为 Stable Diffusion ）的条件控制。ControlNet 通过锁定大型模型的参数并制作其编码层的可训练副本来保持大型模型的质量和特征。该架构将大型预训练模型视为学习各种条件控制的强大主干。可训练副本和原始锁定模型通过零卷积层连接，权重初始化为零，以便在训练过程中逐渐增长。该架构可确保在训练开始时不会将有害噪声添加到大型扩散模型的深度特征中，并保护可训练副本中的大规模预训练主干免受此类噪声的损害。

我们的实验表明，ControlNet 可以通过各种条件输入控制 Stable Diffusion ，包括 Canny 边缘、Hough 线、用户涂鸦、人体关键点、分割图、形状法线、深度等（图 1）。我们使用单个条件图像（带或不带文本提示）测试我们的方法，并演示我们的方法如何支持多种条件的组合。此外，我们报告说，ControlNet 的训练在不同大小的数据集上都是稳健且可扩展的，并且对于某些任务（如深度到图像条件），在单个 NVIDIA RTX 3090Ti GPU 上训练 ControlNet 可以获得与在大型计算集群上训练的工业模型相媲美的结果。最后，我们进行消融研究以调查我们模型每个组件的贡献，并将我们的模型与几个具有用户研究的强条件图像生成基线进行比较。

总之，（1）我们提出了 ControlNet，这是一种神经网络架构，它可以通过有效的微调将空间局部化的输入条件添加到预训练的文本到图像扩散模型中；（2）我们提出预训练的 ControlNets 来控制 Stable Diffusion ，以 Canny 边缘、Hough 线、用户涂鸦、人体关键点、分割图、形状法线、深度和卡通线条画为条件；（3）我们通过与几种替代架构进行比较的消融实验来验证该方法，并开展针对不同任务的几个先前基线的用户研究。

2. 相关工作

2.1 微调神经网络

微调神经网络的一种方法是直接使用额外的训练数据继续训练它。但这种方法可能会导致过度拟合、模式崩溃和灾难性遗忘。大量的研究集中在开发避免此类问题的微调策略上。

超网络（HyperNetwork）是一种起源于自然语言处理（NLP）社区 [25] 的方法，旨在训练一个小型循环神经网络来影响一个较大神经网络的权重。它已被应用于生成对抗网络（GAN）的图像生成 [4, 18]。Heathen 等人 [26] 和 Kurumuz [43] 实现了 Stable Diffusion 的超网络 [71]，以改变其输出图像的艺术风格。

适配器（Adapter）方法在 NLP 中被广泛用于通过将新的模块层嵌入到预训练的 Transformer 模型中来定制它以适应其他任务 [30, 83]。在计算机视觉中，适配器用于增量学习 [73] 和领域自适应 [69]。该技术通常与 CLIP [65] 一起使用，用于将预训练的主干模型转移到不同的任务 [23, 65, 84, 93]。最近，适配器在视觉 Transformer [49, 50] 和 ViT-Adapter [14] 中取得了成功的结果。与我们同时进行的工作中，T2I-Adapter [56] 使 Stable Diffusion 适应外部条件。

加法学习（Additive Learning）通过冻结原始模型权重并使用学习到的权重掩码 [51, 73]、修剪 [52] 或硬注意 [79] 添加少量新参数来避免遗忘。侧调 [91] 使用侧支模型，通过预定义的混合权重方案，线性混合冻结模型和附加网络的输出，来学习额外的特征。

低秩自适应（LoRA）通过使用低秩矩阵学习参数的偏移量来防止灾难性遗忘 [31]，其依据是许多过度参数化的模型驻留在低固有维数子空间中 [2, 47]。

ControlNet 使用零初始化层（Zero-Initialized Layers）连接网络块。神经网络研究广泛讨论了网络权重的初始化和操作 [36, 37, 44, 45, 46, 75, 82, 94]。例如，权重的高斯初始化比用零初始化的风险更小 [1]。最近，Nichol 等人 [58] 讨论了如何缩放扩散模型中卷积层的初始权重以改进训练，他们实现的“零模块”是将权重缩放为零的一个极端情况。稳定性的模型卡 [82] 也提到在神经层中使用零权重。 ProGAN [36]、StyleGAN [37] 和 Noise2Noise [46] 也讨论了如何操纵初始卷积权重。

2.2 图像扩散

图像扩散模型（Image Diffusion Models）最早由 Sohl-Dickstein 等人提出 [80]，最近已应用于图像生成 [17, 42]。潜在扩散模型（LDM）[71] 在潜在图像空间 [19] 中执行扩散步骤，从而降低了计算成本。文本到图像的扩散模型通过预训练语言模型（如 CLIP [65]）将文本输入编码为潜在向量，从而实现最先进的图像生成结果。Glide [57] 是一种支持图像生成和编辑的文本引导扩散模型。Disco Diffusion [5] 使用剪辑引导处理文本提示。Stable Diffusion [81] 是潜在扩散 [71] 的大规模实现。Imagen [77] 使用金字塔结构直接扩散像素，而不使用潜在图像。商业产品包括 DALL-E2 [61] 和 Midjourney [54]。

控制图像扩散模型（Controlling Image Diffusion Models）有助于个性化、定制或特定于任务的图像生成。图像扩散过程直接提供了对颜色变化 [53] 和修复 [66, 7] 的一些控制。文本引导的控制方法侧重于调整提示、操纵CLIP特征和修改交叉注意 [7, 10, 20, 27, 40, 41, 57, 63, 66]。MakeAScene [20] 将分割蒙版编码为标记以控制图像生成。SpaText [6] 将分割蒙版映射到局部标记嵌入中。GLIGEN [48] 在扩散模型的注意层中学习新参数以进行接地生成。Textual Inversion [21] 和 DreamBooth [74] 可以通过使用一小组用户提供的示例图像微调图像扩散模型来个性化生成图像中的内容。基于提示的图像编辑 [10, 33, 85] 提供了使用提示处理图像的实用工具。Voynov 等人 [87] 提出了一种优化方法，该方法可以用草图拟合扩散过程。同时进行的研究 [8, 9, 32, 56] 研究了控制扩散模型的各种方法。

2.3 图像到图像的转换

条件 GAN [15, 34, 62, 89, 92, 96, 97, 98] 和 Transformer [13, 19, 67] 可以学习不同图像域之间的映射，例如，Taming Transformer [19] 是一种视觉 Transformer 方法；Palette [76] 是一种从头开始训练的条件扩散模型；PITI [88] 是一种基于预训练的图像到图像转换条件扩散模型。操纵预训练的 GAN 可以处理特定的图像到图像任务，例如，StyleGAN 可以由额外的编码器控制 [70]，更多应用在 [3, 22, 38, 39, 55, 59, 64, 70] 中进行了研究。

3. 方法

ControlNet 是一种神经网络架构，可以增强具有空间局部化、特定于任务的图像条件的大型预训练文本到图像扩散模型。我们首先在第 3.1 节中介绍 ControlNet 的基本结构，然后在第 3.2 节中描述如何将 ControlNet 应用于图像扩散模型 Stable Diffusion [71]。我们在第 3.3 节中详细说明了我们的训练，并在第 3.4 节中详细介绍了推理过程中的几个额外考虑因素，例如组成多个 ControlNet。

3.1 ControlNet

ControlNet 为神经网络的块注入了额外的条件（图 2）。在此，我们使用术语网络块来指代通常组合在一起形成神经网络单个单元的一组神经层，例如 resnet 块、conv-bn-relu 块、多头注意力块、Transformer 块等。假设 $\mathcal{F} \left ( \cdot ;\ \Theta \right )$ 就是这样一个经过训练的神经块，具有参数 $\Theta$ ，它将输入特征图 $\boldsymbol{x}$ 转换为另一个特征图 $\boldsymbol{y}$ ，如下所示

图 1

在我们的设置中， $\boldsymbol{x}$ 和 $\boldsymbol{y}$ 通常是 2D 特征图，即 $\boldsymbol{x}\in\mathbb{R}^{h\times w\times c}$ ，其中 $\left\{h,\ w,\ c\right\}$ 分别表示图中的高度、宽度和通道数（图 2a）。

要将 ControlNet 添加到这种预训练的神经块中，我们锁定（冻结）原始块的参数 $\Theta$ ，并同时将该块克隆为具有参数 $\Theta_{\rm c}$ 的可训练副本（图 2b）。可训练副本以外部条件向量 $\boldsymbol{c}$ 作为输入。当将此结构应用于 Stable Diffusion 等大型模型时，锁定的参数会保留使用数十亿张图像训练的生产就绪模型，而可训练副本则会重用这种大规模预训练模型来建立深度、稳健且强大的主干，以处理各种输入条件。

图 2

图 2：神经块将特征图 $x$ 作为输入并输出另一个特征图 $y$，如 (a) 所示。要将 ControlNet 添加到这样的块中，我们锁定原始块并创建可训练的副本，然后使用零卷积层将它们连接在一起，即 $1\times 1$ 卷积，权重和偏差都初始化为零。这里 $c$ 是我们希望添加到网络中的条件向量，如 (b) 所示。

可训练副本连接到具有零卷积层的锁定模型，表示为 $\mathcal{Z} \left ( \cdot ;\ \cdot \right )$ 。具体而言， $\mathcal{Z} \left ( \cdot ;\ \cdot \right )$ 是一个 $1\times 1$ 卷积层，权重和偏差均初始化为零。为了构建 ControlNet，我们使用两个零卷积实例，参数分别为 $\Theta_{\rm z1}$ 和 $\Theta_{\rm z2}$ 。然后，完整的 ControlNet 计算

公式 2

其中 $\boldsymbol{y}_{\rm c}$ 是 ControlNet 模块的输出。在第一个训练步骤中，由于零卷积层的权重和偏差参数都初始化为零，因此等式 (2) 中的两个 $\mathcal{Z} \left ( \cdot ;\ \cdot \right )$ 项都计算为零，并且

公式 3

这样，当训练开始时，有害噪声就不会影响可训练副本中神经网络层的隐藏状态。此外，由于 $\mathcal{Z} \left (\boldsymbol{c};\ \Theta_{\rm z1}\right)=\boldsymbol{0}$ 且可训练副本也接收输入图像 $\boldsymbol{x}$ ，因此可训练副本完全可以正常工作，并保留了大型预训练模型的特征，使其成为进一步学习的强大支柱。零卷积通过在初始训练步骤中消除梯度形式的随机噪声来保护该支柱。我们在补充材料中详细介绍了零卷积的梯度计算。

3.2 用于文本到图像扩散的 ControlNet

我们使用 Stable Diffusion [71] 为例来展示 ControlNet 如何为大型预训练扩散模型添加条件控制。 Stable Diffusion 本质上是一个 U-Net [72]，带有一个编码器、一个中间块和一个跳过连接的解码器。编码器和解码器都包含 12 个块，完整模型包含 25 个块，包括中间块。在 25 个块中，8 个块是下采样或上采样卷积层，而其他 17 个块是主块，每个块包含 4 个 resnet 层和 2 个 Vision Transformers（ViT）。每个 ViT 包含多个交叉注意和自注意机制。例如，在图 3a 中，“SD 编码器块 A” 包含 4 个 resnet 层和 2 个 ViT，而 “×3” 表示该块重复三次。文本提示使用 CLIP 文本编码器 [65] 进行编码，扩散时间步长使用位置编码的时间编码器进行编码。

ControlNet 结构应用于 U-net 的每个编码器级别（图 3b）。具体来说，我们使用 ControlNet 创建 12 个编码块和 1 个 Stable Diffusion 中间块的可训练副本。12 个编码块有 4 种分辨率（64×64、32×32、16×16、8×8），每个分辨率复制 3 次。输出被添加到 U-net 的 12 个跳过连接和 1 个中间块。由于 Stable Diffusion 是典型的 U-net 结构，因此这种 ControlNet 架构很可能适用于其他模型。

图 3

图 3：Stable Diffusion 的 U-net 架构与编码器块和中间块上的 ControlNet 相连。锁定的灰色块显示 Stable Diffusion V1.5（或 V2.1，因为它们使用相同的 U-net 架构）的结构。可训练的蓝色块和白色零卷积层被添加以构建 ControlNet。

我们连接 ControlNet 的方式在计算上是高效的——由于锁定的副本参数是冻结的，因此在最初锁定的编码器中不需要梯度计算来进行微调。这种方法加快了训练速度并节省了 GPU 内存。在单个 NVIDIA A100 PCIE 40GB 上进行测试，与不使用 ControlNet 优化 Stable Diffusion 相比，使用 ControlNet 优化 Stable Diffusion 只需要大约多 23% 的 GPU 内存和每次训练迭代多 34% 的时间。

图像扩散模型学习逐步对图像进行去噪并从训练域生成样本。去噪过程可以发生在像素空间中，也可以发生在由训练数据编码的潜在空间中。 Stable Diffusion 使用潜在图像作为训练域，因为事实证明，在这个空间中工作可以稳定训练过程 [71]。具体来说， Stable Diffusion 使用类似于 VQ-GAN [19] 的预处理方法，将 512×512 像素空间图像转换为更小的 64×64 潜在图像。为了将 ControlNet 添加到 Stable Diffusion 中，我们首先将每个输入条件图像（例如边缘、姿势、深度等）从 512×512 的输入大小转换为与 Stable Diffusion 大小匹配的 64×64 特征空间向量。具体来说，我们使用一个由四个卷积层组成的微型网络 $\mathcal{E} \left ( \cdot \right )$ ，卷积核为 4×4，步长为 2×2（由 ReLU 激活，分别使用 16、32、64、128 个通道，用高斯权重初始化并与完整模型联合训练），将图像空间条件 $\boldsymbol{c}_{\rm i}$ 编码为特征空间条件向量 $\boldsymbol{c}_{\rm f}$ ，如下所示，

公式 4

条件向量 $\boldsymbol{c}_{\rm f}$ 被传递到 ControlNet 中。

3.3 训练

给定一个输入图像 $\boldsymbol{z}_0$ ，图像扩散算法逐渐向图像添加噪声并产生一个噪声图像 $\boldsymbol{z}_t$ ，其中 $t$ 表示添加噪声的次数。给定一组条件，包括时间步长 $t$ 、文本提示 $\boldsymbol{c}_t$ 以及特定于任务的条件 $\boldsymbol{c}_f$ ，图像扩散算法学习网络 $\epsilon_\theta$ 来预测添加到噪声图像 $\boldsymbol{z}_t$ 的噪声

公式 5

其中 $\mathcal{L}$ 是整个扩散模型的总体学习目标。此学习目标直接用于使用 ControlNet 对扩散模型进行微调。
在训练过程中，我们随机用空字符串替换 50% 的文本提示 $\boldsymbol{c}_{t}$ 。这种方法提高了 ControlNet 直接识别输入条件图像中的语义（例如边缘、姿势、深度等）作为提示的替代的能力。

在训练过程中，由于零卷积不会给网络添加噪声，因此模型应该始终能够预测高质量的图像。我们观察到，模型不会逐渐学习控制条件，而是突然成功跟随输入条件图像；通常在不到 10K 个优化步骤中。如图 4 所示，我们称之为 “突然收敛现象”。

图 4

图 4：突然收敛现象。由于零卷积，ControlNet 在整个训练过程中始终预测高质量图像。在训练过程的某个步骤（例如，以粗体标记的 6133 个步骤）中，模型突然学会遵循输入条件。

3.4 推理

我们可以进一步通过多种方式控制 Controlnet 的额外条件如何影响去噪扩散过程。

无分类器指导分辨率加权。 Stable Diffusion 依赖于一种称为无分类器指导（CFG）[29] 的技术来生成高质量图像。CFG 公式为 $\epsilon_{\rm prd}=\epsilon_{\rm uc}+\beta_{\rm cfg}\left(\epsilon_{\rm c}-\epsilon_{\rm uc}\right)$ ，其中 $\epsilon_{\rm prd},\ \epsilon_{\rm uc},\ \epsilon_{\rm c},\ \beta_{\rm cfg}$ 分别是模型的最终输出、无条件输出、条件输出和用户指定的权重。通过 ControlNet 添加条件图像时，可以将其添加到 $\epsilon_{\rm uc}$ 和 $\epsilon_{\rm c}$ ，也可以仅添加到 $\epsilon_{\rm c}$ 。在具有挑战性的情况下，例如当没有给出提示时，将其添加到 $\epsilon_{\rm uc}$ 和 $\epsilon_{\rm c}$ 将完全删除 CFG 指导（图 5b）；仅使用 $\epsilon_{\rm c}$ 将使指导非常强（图 5c）。我们的解决方案是首先将条件图像添加到 $\epsilon_{\rm c}$ ，然后根据每个块的分辨率 $w_{i}=64/h_{i}$ ，将权重 $w_i$ 乘以 Stable Diffusion 和 ControlNet 之间的每个连接，其中 $h_i$ 是第 $i$ 个块的大小，例如 $h_{1}=8,\ h_{2}=16,\ \cdots,\ h_{13}=64$ 。通过降低 CFG 指导强度，我们可以实现图 5d 所示的结果，我们称之为 CFG 分辨率加权。

组合多个 ControlNet。要将多个条件图像（例如，Canny 边缘和姿势）应用于 Stable Diffusion 的单个实例，我们可以将相应 ControlNet 的输出直接添加到 Stable Diffusion 模型（图 6）。这种组合不需要额外的权重或线性插值。

图 5

图 5：无分类器指导（CFG）和提出的 CFG 分辨率加权（CFG-RW）的效果。

图 6

图 6：多种条件的组合。我们展示了同时使用深度和姿势的应用程序。

4. 实验

我们实现了具有 Stable Diffusion 的 ControlNets 来测试各种条件，包括 Canny Edge [11]、深度图 [68]、法线图 [86]、M-LSD 线 [24]、HED 软边缘 [90]、ADE20K 分割 [95]、Openpose [12] 和用户草图。另请参阅补充材料，了解每种条件的示例以及详细的训练和推理参数。

4.1 定性结果

图 1 显示了在几个提示设置中生成的图像。图 7 显示了我们在没有提示的各种条件下的结果，其中 ControlNet 可以稳健地解释各种输入条件图像中的内容语义。

在这里插入图片描述

图 7：无提示条件下控制 Stable Diffusion。顶行是输入条件，其他行都是输出。我们使用空字符串作为输入提示。所有模型都使用通用域数据进行训练。模型必须识别输入条件图像中的语义内容才能生成图像。

4.2 消融研究

我们研究了 ControlNets 的替代结构，方法是 (1) 用高斯权重初始化的标准卷积层替换零卷积，以及（2）用一个卷积层替换每个块的可训练副本，我们称之为 ControlNet-lite。有关这些消融结构的完整详细信息，另请参阅补充材料。

我们提供了 4 种提示设置，以测试真实用户的可能行为：（1）无提示；（2）提示不足，没有完全覆盖条件图像中的对象，例如本文的默认提示 “高质量、详细和专业的图像”；（3）改变条件图像语义的冲突提示；（4）描述必要内容语义的完美提示，例如 “漂亮的房子”。图 8a 显示 ControlNet 在所有 4 种设置中都成功。轻量级 ControlNet-lite（图 8c）不够强大，无法解释条件图像，并且在提示不足和无提示条件下失败。当零卷积被替换时，ControlNet 的性能下降到与 ControlNet-lite 大致相同，这表明可训练副本的预训练主干在微调过程中被破坏（图 8b）。

4.3 定量评估

用户研究。我们抽样 20 幅未见过的手绘草图，然后将每幅草图分配给 5 种方法：PIPT [88] 的草图模型、具有默认边缘引导尺度（ $\beta=1.6$ ）的草图引导扩散（SGD）[87]、具有相对较高边缘引导尺度（ $\beta=3.2$ ）的 SGD [87]、前面提到的 ControlNet-lite 和 ControlNet。我们邀请了 12 位用户根据 “显示图像的质量” 和 “草图保真度” 分别对这 20 组 5 个结果进行排名。通过这种方式，我们获得了 100 个结果质量排名和 100 个条件保真度排名。我们使用平均人类排名（AHR）作为偏好指标，用户按 1 到 5 的等级对每个结果进行排名（越低越差）。平均排名如表 1 所示。

表 1

表 1：结果质量和条件保真度的平均用户排名（AUR）。我们报告了不同方法的用户偏好排名（1 到 5 表示从最差到最好）。

与工业模型的比较。 Stable Diffusion V2 深度到图像（SDv2-D2I）[82] 使用大规模 NVIDIA A100 集群、数千个 GPU 小时和超过 1200 万张训练图像进行训练。我们在相同的深度条件下为 SD V2 训练 ControlNet，但仅使用 200k 个训练样本、一台 NVIDIA RTX 3090Ti 和 5 天的训练时间。我们使用每个 SDv2-D2I 和 ControlNet 生成的 100 张图像来教 12 位用户区分这两种方法。之后，我们生成 200 张图像并要求用户说出哪个模型生成了每张图像。用户的平均精度为 0.52±0.17，表明这两种方法产生的结果几乎无法区分。

条件重建和 FID 分数。我们使用 ADE20K [95] 测试集来评估条件保真度。最先进的分割方法 OneFormer [35] 在真实集上实现了 0.58 的交并比（IoU）。我们使用不同的方法生成具有 ADE20K 分割的图像，然后应用 One-Former 再次检测分割以计算重建的 IoU（表 2）。此外，我们使用 Frechet Inception Distance（FID）[28] 测量使用不同分割条件方法随机生成的 512×512 图像集的分布距离，以及表 3 中的文本图像 CLIP 分数 [65] 和 CLIP 美学分数 [78]。有关详细设置，另请参阅补充材料。

表 2

表 2：使用交并比（IoU ↑）对语义分割标签重建（ADE20K）的评估。

表 3

表 3：以语义分割为条件的图像生成评估。我们报告了我们的方法和其他基线的 FID、CLIP 文本图像分数和 CLIP 美学分数。我们还报告了没有分割条件的 Stable Diffusion 的性能。标有 “*” 的方法是从头开始训练的。

4.4 与以前方法的比较

图 9 展示了基线和我们的方法（ Stable Diffusion +ControlNet）的视觉比较。具体来说，我们展示了 PTIT [88]、Sketch-Guided Diffusion [87] 和 Taming Transformers [19] 的结果。我们观察到 Control-Net 可以稳健地处理不同的条件图像并获得清晰的结果。

图 9

图 9：与以前的方法的比较。我们与 PITI [88]、Sketch-Guided Diffusion [87] 和 Taming Transformers [19] 进行了定性比较。

4.5 讨论

训练数据集大小的影响。我们在图 10 中展示了 ControlNet 训练的稳健性。训练不会因有限的 1k 幅图像而崩溃，并允许模型生成可识别的狮子。当提供更多数据时，学习是可扩展的。

在这里插入图片描述

图 10：不同训练数据集大小的影响。有关扩展示例，另请参阅补充材料。

解释内容的能力。我们在图 11 中展示了 ControlNet 从输入条件图像中捕获语义的能力。

图 11

图 11：解释内容。如果输入不明确，并且用户在提示中没有提到对象内容，则结果看起来就像模型尝试解释输入形状。

转移到社区模型。由于 ControlNets 不会改变预训练 SD 模型的网络拓扑，因此它可以直接应用于 Stable Diffusion 社区中的各种模型，例如 Comic Diffusion [60] 和 Protogen 3.4 [16]，如图 12 所示。

图 12

图 12：将预训练的 ControlNet 转移到社区模型 [16, 60]，无需再次训练神经网络。

5. 结论

ControlNet 是一种用于学习大型预训练文本到图像扩散模型的条件控制的神经网络结构。它重用源模型的大规模预训练层来构建深度强大的编码器以学习特定条件。原始模型和可训练副本通过“零卷积”层连接，以消除训练过程中的有害噪声。大量实验验证了 ControlNet 可以有效地控制单个或多个条件、有或无提示的 Stable Diffusion。在不同条件数据集上的结果表明，ControlNet 结构可能适用于更广泛的条件，并促进相关应用。

致谢

本研究部分由斯坦福以人为本人工智能研究所和布朗媒体创新研究所资助。

参考文献

[1] Sadia Afrin. Weight initialization in neural network, inspired by andrew ng, https://medium.com/@safrin1128/weight- initialization-in-neural-network-inspired-by-andrew-ng- e0066dc4a566, 2020. 3
[2] Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. In- trinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Process- ing, pages 7319–7328, Online, Aug. 2021. Association for Computational Linguistics. 3
[3] Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. Only a matter of style: Age transformation using a style-based regression model. ACM Transactions on Graphics (TOG), 40(4), 2021. 3
[4] Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and Amit Bermano. Hyperstyle: Stylegan inversion with hypernetworks for real image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18511–18521, 2022. 2
[5] Alembics.Discodiffusion,https://github.com/alembics/disco- diffusion, 2022. 3
[6] Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, and Xi Yin. Spatext: Spatio-textual representation for con- trollable image generation. arXiv preprint arXiv:2211.14305, 2022. 2, 3
[7] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022. 3
[8] Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113, 2023. 3
[9] DinaBashkirova,JoseLezama,KihyukSohn,KateSaenko, and Irfan Essa. Masksketch: Unpaired structure-guided masked image generation. arXiv preprint arXiv:2302.05496, 2023. 3
[10] Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022. 2, 3
[11] John Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, (6):679–698, 1986. 6
[12] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. Openpose: Realtime multi-person 2d pose estima- tion using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. 6
[13] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and
Wen Gao. Pre-trained image processing transformer. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12299–12310, 2021. 3
[14] Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. International Conference on Learning Representations, 2023. 2
[15] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified genera- tive adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8789–8797, 2018. 3
[16] darkstorm2150. Protogen x3.4 (photorealism) offi- cial release, https://civitai.com/models/3666/protogen-x34- photorealism-official-release, 2022. 8
[17] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021. 3
[18] TanM.Dinh,AnhTuanTran,RangNguyen,andBinh-Son Hua. Hyperinverter: Improving stylegan inversion via hy- pernetwork. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11389– 11398, 2022. 2
[19] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021. 3, 5, 7, 8
[20] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene- based text-to-image generation with human priors. In Euro- pean Conference on Computer Vision (ECCV), pages 89–106. Springer, 2022. 2, 3
[21] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image genera- tion using textual inversion. arXiv preprint arXiv:2208.01618, 2022. 2, 3
[22] RinonGal,OrPatashnik,HaggaiMaron,AmitHBermano, Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip- guided domain adaptation of image generators. ACM Trans- actions on Graphics (TOG), 41(4):1–13, 2022. 3
[23] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip- adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021. 2
[24] Geonmo Gu, Byungsoo Ko, SeoungHyun Go, Sung-Hyun Lee, Jingeun Lee, and Minchul Shin. Towards light-weight and real-time line segment detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022. 6
[25] David Ha, Andrew M. Dai, and Quoc V. Le. Hypernetworks. In International Conference on Learning Representations, 2017. 2
[26] Heathen. Hypernetwork style training, a tiny guide, stable- diffusion-webui, https://github.com/automatic1111/stable- diffusion-webui/discussions/2670, 2022. 2
[27] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022. 3
[28] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fer- gus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. 8
[29] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. 5
[30] NeilHoulsby,AndreiGiurgiu,StanislawJastrzebski,Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799, 2019. 2
[31] EdwardJHu,YelongShen,PhillipWallis,ZeyuanAllen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 2
[32] Lianghua Huang, Di Chen, Yu Liu, Shen Yujun, Deli Zhao, and Zhou Jingren. Composer: Creative and controllable image synthesis with composable conditions. 2023. 3
[33] Nisha Huang, Fan Tang, Weiming Dong, Tong-Yee Lee, and Changsheng Xu. Region-aware diffusion for zero-shot text- driven image editing. arXiv preprint arXiv:2302.11797, 2023. 3
[34] PhillipIsola,Jun-YanZhu,TinghuiZhou,andAlexeiAEfros. Image-to-image translation with conditional adversarial net- works. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1125–1134, 2017. 1, 3
[35] Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. OneFormer: One Transformer to Rule Universal Image Segmentation. 2023. 7
[36] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. International Conference on Learning Repre- sentations, 2018. 3
[37] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019. 3
[38] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. IEEE Transactions on Pattern Analysis, 2021. 3
[39] Oren Katzir, Vicky Perepelook, Dani Lischinski, and Daniel Cohen-Or. Multi-level latent space structuring for generative control. arXiv preprint arXiv:2202.05910, 2022. 3
[40] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022. 3
[41] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Dif- fusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022. 3
[42] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in Neural Information Processing Systems, 34:21696–21707, 2021. 3
[43] Kurumuz. Novelai improvements on stable diffusion, https://blog.novelai.net/novelai-improvements-on-stable-diffusion-e10d38db82ac, 2022. 2
[44] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, May 2015. 3
[45] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient- based learning applied to document recognition. Proceedings
of the IEEE, 86(11):2278–2324, 1998. 3
[46] Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, and Timo Aila. Noise2noise: Learning image restoration without clean data. Proceedings of the 35th International Conference on Machine Learning, 2018. 3
[47] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. International Conference on Learning Represen- tations, 2018. 3
[48] YuhengLi,HaotianLiu,QingyangWu,FangzhouMu,Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. 2023. 3
[49] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object de- tection. arXiv preprint arXiv:2203.16527, 2022. 2
[50] Yanghao Li, Saining Xie, Xinlei Chen, Piotr Dollar, Kaim- ing He, and Ross Girshick. Benchmarking detection transfer learning with vision transformers. arXiv preprint arXiv:2111.11429, 2021. 2
[51] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggy- back: Adapting a single network to multiple tasks by learning to mask weights. In European Conference on Computer Vi- sion (ECCV), pages 67–82, 2018. 2
[52] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multi- ple tasks to a single network by iterative pruning. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018. 2
[53] ChenlinMeng,YutongHe,YangSong,JiamingSong,Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021. 3
[54] Midjourney. https://www.midjourney.com/, 2023. 1, 3
[55] Ron Mokady, Omer Tov, Michal Yarom, Oran Lang, Inbar Mosseri, Tali Dekel, Daniel Cohen-Or, and Michal Irani. Self-distilled stylegan: Towards generation from internet photos. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–9, 2022. 3
[56] Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhon- gang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023. 2, 3
[57] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. 2022. 3
[58] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021. 3
[59] YotamNitzan,KfirAberman,QiuruiHe,OrlyLiba,Michal Yarom, Yossi Gandelsman, Inbar Mosseri, Yael Pritch, and Daniel Cohen-Or. Mystyle: A personalized generative prior. arXiv preprint arXiv:2203.17272, 2022. 3
[60] ogkalu. Comic-diffusion v2, trained on 6 styles at once, https://huggingface.co/ogkalu/comic-diffusion, 2022. 8
[61] OpenAI. Dall-e-2, https://openai.com/product/dall-e-2, 2023. 1, 3
[62] TaesungPark,Ming-YuLiu,Ting-ChunWang,andJun-Yan Zhu. Semantic image synthesis with spatially-adaptive nor- malization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2337–2346, 2019. 3
[63] GauravParmar,KrishnaKumarSingh,RichardZhang,Yijun Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image translation. arXiv preprint arXiv:2302.03027, 2023. 3
[64] OrPatashnik,ZongzeWu,EliShechtman,DanielCohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 2085–2094, October 2021. 3
[65] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 2, 3, 4, 8
[66] AdityaRamesh,PrafullaDhariwal,AlexNichol,CaseyChu, and Mark Chen. Hierarchical text-conditional image genera- tion with clip latents. arXiv preprint arXiv:2204.06125, 2022. 3
[67] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Confer- ence on Machine Learning, pages 8821–8831. PMLR, 2021. 3
[68] Rene ́ Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1623–1637, 2020. 6 In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 3
[71] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjo ̈rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 1, 2, 3, 4, 5, 7
[72] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Inter- vention MICCAI International Conference, pages 234–241, 2015. 4
[73] Amir Rosenfeld and John K Tsotsos. Incremental learning through deep adaptation. IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 42(3):651–663, 2018. 2
[74] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven gen- eration. arXiv preprint arXiv:2208.12242, 2022. 2, 3
[75] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating er- rors. Nature, 323(6088):533–536, Oct. 1986. 3
[76] ChitwanSaharia,WilliamChan,HuiwenChang,ChrisLee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, SIGGRAPH ’22, New York, NY, USA, 2022. Association for Computing Ma- chinery. 3
[77] ChitwanSaharia,WilliamChan,SaurabhSaxena,LalaLi,Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022. 3
[78] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. 2, 8
[79] JoanSerra,DidacSuris,MariusMiron,andAlexandrosKarat- zoglou. Overcoming catastrophic forgetting with hard atten- tion to the task. In International Conference on Machine Learning, pages 4548–4557. PMLR, 2018. 2
[80] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Confer- ence on Machine Learning, pages 2256–2265. PMLR, 2015.
[69] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Efficient parametrization of multi-domain deep neural net- 3 works. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 8119–8127, 2018. 2
[70] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in style: a stylegan encoder for image-to-image translation.
[81] Stability. Stable diffusion v1.5 model card, https://huggingface.co/runwayml/stable-diffusion-v1-5, 2022. 2, 3
[82] Stability. Stable diffusion v2 model card, stable-diffusion- 2-depth, https://huggingface.co/stabilityai/stable-diffusion-2- depth, 2022. 3, 7
[83] Asa Cooper Stickland and Iain Murray. Bert and pals: Pro- jected attention layers for efficient adaptation in multi-task learning. In International Conference on Machine Learning, pages 5986–5995, 2019. 2
[84] Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. arXiv preprint arXiv:2112.06825, 2021. 2
[85] NarekTumanyan,MichalGeyer,ShaiBagon,andTaliDekel. Plug-and-play diffusion features for text-driven image-to- image translation. arXiv preprint arXiv:2211.12572, 2022. 3
[86] Igor Vasiljevic, Nick Kolkin, Shanyi Zhang, Ruotian Luo, Haochen Wang, Falcon Z Dai, Andrea F Daniele, Moham- madreza Mostajabi, Steven Basart, Matthew R Walter, et al. Diode: A dense indoor and outdoor depth dataset. arXiv preprint arXiv:1908.00463, 2019. 6
[87] Andrey Voynov, Kfir Abernan, and Daniel Cohen-Or. Sketch- guided text-to-image diffusion models. 2022. 3, 6, 7, 8
[88] Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong Chen, Qifeng Chen, and Fang Wen. Pretraining is all you need for image-to-image translation. 2022. 3, 6, 7, 8
[89] Ting-ChunWang,Ming-YuLiu,Jun-YanZhu,AndrewTao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8798–8807, 2018. 3
[90] Saining Xie and Zhuowen Tu. Holistically-nested edge detec- tion. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1395–1403, 2015. 6
[91] Jeffrey O. Zhang, Alexander Sax, Amir Zamir, Leonidas J. Guibas, and Jitendra Malik. Side-tuning: Network adapta-tion via additive side networks. In European Conference on Computer Vision (ECCV), pages 698–714. Springer, 2020. 2
[92] PanZhang,BoZhang,DongChen,LuYuan,andFangWen. Cross-domain correspondence learning for exemplar-based image translation. In Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition, pages 5143–5153, 2020. 3
[93] Renrui Zhang, Rongyao Fang, Peng Gao, Wei Zhang, Kun-chang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip- adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021. 2
[94] Jiawei Zhao, Florian Scha ̈fer, and Anima Anandkumar. Zero initialization: Initializing residual networks with only zeros and ones. arXiv, 2021. 3
[95] BoleiZhou,HangZhao,XavierPuig,SanjaFidler,AdelaBar- riuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 633–641, 2017. 6, 7
[96] XingranZhou,BoZhang,TingZhang,PanZhang,Jianmin Bao, Dong Chen, Zhongfei Zhang, and Fang Wen. Cocos- net v2: Full-resolution correspondence learning for image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11465– 11475, 2021. 3
[97] Jun-YanZhu,TaesungPark,PhillipIsola,andAlexeiAEfros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017. 1, 3
[98] Jun-YanZhu,RichardZhang,DeepakPathak,TrevorDarrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation. Advances in Neural Information Processing Systems, 30, 2017. 3