当前位置：首页 > article >正文

diffusion model (九) EmuEdit技术小结

article 2025/2/19 6:50:38

文章目录

- 背景
- 1 核心思想
- 2 方法
- - 2.1 方法建模
  - 2.2 数据工程
  - - 2.2.1 image-edit任务类别定义
    - 2.2.2 指令集生成
    - 2.2.3 图片对的生成
- 3 结果

Paper: https://emu-edit.metademolab.com/assets/emu_edit.pdf

Project web: https://emu-edit.metademolab.com/

Code: have not opensource

背景

在发布Emu后，近日，META又发布了两个非常惊艳的工作：EmuEdit、EmuVideo。文本将对EmuEdit相关技术进行总结。

在这里插入图片描述

1 核心思想

作者将intruction-base image editing任务建模为生成任务，并用diffusion model进行求解。核心创新点有两个

详细定义了instruction-based image edit处理的任务，并设计了一个高效高质量的数据构建方法。
为提升模型对instruction的理解能力，引入learnable task embedding，能较好的解决上述问题。并且提出task inversion的训练方法，只需少量数据就能有效将模型扩展到新的task（类似textual inversion的思想）。

2 方法

2.1 方法建模

前面提到，作者将一系列的intruction-base image editing任务建模为生成任务，并用diffusion model来求解。具体来看intruction-base image editing任务做的是这么一件事：给定一张参考图片和一段表述文本，输出符合上述两个条件的图片。从上述描述可知：intruction-base image editing的训练数据应当至少是一个三元组 $\mathcal{D} = \{(c_T^{i}, c_I^{i}, x^{i})|i = 1, \cdots N\}$ ，其中

$c_I$ : 参考图片(condition of image)

$c_T$ : 参考文本（condition of text）

$x$ : 目标图片

这样，基于diffusion model的优化目标可建模为：
$\min _ { \theta } \mathbb{E} _ { y , \epsilon , t } [ \Vert \epsilon - \epsilon _ { \theta } ( z _ { t } , t , E ( c _ { I } ) , c _ { T } ) \Vert _ { 2 } ^ { 2 } ] ^ { 2 } \tag{1}$
和经典的classifier-free有所区别的是，此处多了一个参考图片的condition $E ( c _ { I } )$ 。条件融入的方法上，

作者参考Instructpix2pix将image condition融入到输入层(在通道维度进行concat）。
参考classifier-free将text condition的融入在cross-attention。

通过实验，作者发现用上述方法训练的模型对task的理解不够准确如下图所示。为此，作者引入learnable task embedding来增强模型对task的理解。此时的优化目标建模为：
$\min _ { \theta , v _ { 1 } , \dots , v _ { k } } \mathbb{E} _ { \hat { y } , \epsilon , t } [ \Vert \epsilon - \epsilon _ { \theta } ( z _ { t } , t , E ( c _ { I } ) , c _ { T } , v _ { i } ) \Vert _{2}^{2} ] \tag{2}$
为了求解上述目标方程，构造的训练数据集的每一个元素应当是一个四元组 $\mathcal{D}_{new} = \{(c_T^{i}, c_I^{i}, x^{i}, v^{i}_{j})|i = 1, \cdots N\}$ ， $v_j$ 为这条数据所属的task类别。并且此时的diffusion model的噪声预测模型多了一个task embedding $v_i$ 的条件。作者的融入方式是将其与time-step的embedding进行相加，共同融入到cross-attention中。这样设计还保留了可扩展性：当有一个新的task时，可以将优化目标转化为
$\min _ { v _ { \mathrm{n e w} } } \mathbb{E} _ { y , \epsilon , t } [ \Vert \epsilon - \epsilon _ { \theta } ( z _ { t } , t , E ( c _ { I } ) , c _ { T } , v _ { n e w } ) \Vert _ { 2 } ^ { 2 } \tag{3}$
此时训练的参数仅为新增的task embedding，其它参数都freeze。作者将其称之为task inversion(类似textual inversion)。

在用户层的推理阶段，用户无需输入task index，作者基于Flan-T5-XL训练了一个task index预测模型，来根据用户输入的instruction预测出相应的task index。
在这里插入图片描述

从实现原理上，上述方法不难想到。论文取得的卓越的效果取决于训练的数据集。下面来看作者是如何用一种高效的方法构建高质量的数据集。

2.2 数据工程

前文提到，训练一个image-edit diffusion model训练数据至少是一个三元组 $\mathcal{D} = \{(c_T^{i}, c_I^{i}, x^{i})|i = 1, \cdots N\}$ (其中 $c_I$ : 参考图片(condition of image) $c_T$ : 参考文本（condition of text） $x$ : 目标图片)。手动构建数据集的成本非常大，开源数据规模又不够大，一些规模大的合成数据多样性和质量又不高，因此需要探寻如何用cheap的方法来构建一个高质量、大规模、高多样的image-edit数据集。为了结合task inversion，新构建的数据集应当是一个四元组 $\mathcal{D}_{new} = \{(c_T^{i}, c_I^{i}, x^{i}, v^{i}_{j})|i = 1, \cdots N\}$ , $v_j$ 为这条数据所属的task类别。

2.2.1 image-edit任务类别定义

作者将image-edit分为了三大类，分别是Region-based Editing、Free-From Editing、Vision tasks，每个大类中有若干小类。下图展示了每一个image-edit任务所做的事

在这里插入图片描述

2.2.2 指令集生成

任务定义：已知image caption和编辑任务，输出满足编辑任务新的caption

输入：image caption + edit任务
输出：edit instruction， edit instruction应当包含：1）edit指令；2）edit的目标(edited object)；3）新的image caption；4）原始目标（original object）（7.2节提到有这个字段，但在7.1中的示例没有，实际上应当要有这个字段，否则后续的mask提取无法进行）

举个例子（对于add的image-edit任务）

输入：{“image_caption”: “Beautiful cat with mojito sitting in a cafe on the street”, “task”: “Add”}

输出：{“edit”: “include a hat”, “edited object”: “hat”, “output”: “Beautiful cat wearing a hat with mojito sitting in a cafe on the street”, “original object”: “cat”}

作者用context learning的方法来实现上述任务的目标。作者构建的prompt方案如下：（作者所用的LLM是微调了的70B LLama2，我用chatgpt尝试了一下，也能实现类似的效果）

import random 
import torch 
from random import choice, shuffle

few_shot_examples = [
    """
    [INST]User: "Beautiful cat with mojito sitting in a cafe on the street"[/INST] 
    Assistant: {"edit": "include a hat", "edited object": "hat", "output": "Beautiful cat wearing a hat with mojito sitting in a cafe on the street"}""",
    """
    [INST]User: "robot playing chess at home."[/INST] 
    Assistant: {"edit": "add a cheerful smiling face.", "edited object": "robot", "output": "robot playing chess at home with a cheerful smiling face."} """,
    """
    [INST]User: "A cute creature sits at the beach."[/INST] 
    Assistant: {"edit": "set a dog besides the creature", "edited object": "dog", "output": "A cute creature and a dog sit at the beach."} """,
    """
    [INST]User: "Superhero on the street in sunny day working on his tablet."[/INST] 
    Assistant: {"edit": "put a vintage tie on the superhero.", "edited object": "tie", "output": "Superhero with a vintage tie on the street in sunny day working on his tablet."} """,
    """
    [INST]User: "Picture clouds, birds, the wind, foliage, rainbow, hill, art, pair, guy"[/INST] 
    Assistant: {"edit": "together with a dog on the left", "edited object": "dog", "output": "Picture clouds, birds, the wind, foliage, rainbow, hill, art, pair, guy, dog on the left"} """,
    """
    [INST]User: "horse on a red Boat Near Mountains During Golden Hour"[/INST] 
    Assistant: {"edit": "give the horse sunglassess", "edited object": "sunglassess", "output": "horse with sunglassess on a red Boat Near Mountains During Golden Hour"} """,
    """
    [INST]User: "An animal family on studio background."[/INST] 
    Assistant: {"edit": "make them hold a teddy bear.", "edited object": "teddy bear", "output": "An animal family holding a teddy bear on studio background."} """,
    """
    [INST]User: "Baked Salmon With Bell Peppers"[/INST] 
    Assistant: {"edit": "insert kale pesto to the dish", "edited object": "kale pesto", "output": "Baked Salmon With Kale Pesto And Bell Peppers"} """,
    """
    [INST]User: "An airplaine is flying in the sky in rainy day."[/INST] 
    Assistant: {"edit": "add flowers in the windows", "edited object": "flowers", "output": "An airplaine with flowers in the windows is flying in the sky in rainy day."} """,
    """
    [INST]User: "photo of mountains and trees"[/INST] 
    Assistant: {"edit": "position a castle between the trees", "edited object": "castle", "output": "photo of mountains, trees and castle between the trees"} """,
    """
    [INST]User: "Little bunny in the park"[/INST] 
    Assistant: {"edit": "Make the bunny play with a kite.", "edited object": "kite", "output": "Little bunny playing with a kite in the park"} """,
    """
    [INST]User: "Attic Bedroom With Large Ceilings"[/INST] 
    Assistant: {"edit": "decorate the room with beautiful chandeliers", "edited object": "chandeliers", "output": "Attic Bedroom With Beautiful Chandeliers on Large Ceilings"} """,
    """
    [INST]User: "Wedding rings and yellow flower on a red background"[/INST]  
    Assistant: {"edit": "place a bird by the yellow flower", "edited object": "bird", "output": "Wedding rings, a bird, and yellow flower on a red background"} """,
    """
    [INST]User: "Tree Near the lake in the morning"[/INST] 
    Assistant: {"edit": "Give it autumn leaves on top", "edited object": "leaves", "output": "Tree with autumn leaves on top Near the lake in the morning"} """,
    """
    [INST]User: "robot and alien sitting on hanging bridge at daytime"[/INST]  
    Assistant: {"edit": "make them hold three books.", "edited object": "threebooks", "output": "robot and alien holding three books while sitting on hanging bridge at daytime"} """,
    """
    [INST]User: "Skogafoss waterfall in the south of Iceland"[/INST]  
    Assistant: {"edit": "Set a colorful rainbow in the backhground!", "edited object": "rainboe", "output": "Skogafoss waterfall with a colorful rainbow in the south of Iceland"} """,
    """
    [INST]User: "Polar Bear with rubber gloves pushing shopping carts"[/INST]  
    Assistant: {"edit": "Make it wear a coat", "edited object": "coat", "output": "Polar Bear with a coat pushing shopping carts"}
    """
    ]


def get_content_instruction(new_prompt): 
    optional_verbs = choice(["include", "place", "position", "set", "incorporate", "alongside", " give", "put", "insert", "together with", "with", "make", "integrate", "have", "append", " make", "add", "include"]) 
     # system message # 
    system_message = f"""
        <<SYS>>You are an assistant that only speaks JSON. Do not write normal text. The assistant answer is 
            JSON with the following string fields: 'edit', 'edited object','output'. 
            Here is the latest conversation between Assistant and User.
        <</SYS>>
        """
     # introduction message 
    intro_message = f"""
    [INST]User: Hi, My job to take a given caption ('input') and to output the following: an instruction for {optional_verbs} an object to the image ('edit'), the object to {optional_verbs} ('edited object'), and the caption with the object ('output'). Please help me do it.
        I will give you the 'input', and you will help. When you reply, use the following format: {"edit": '<instruction>', 'edited object': '<object>', 'output': '<caption>'}
    [/INST]
    Assistant: Sure, I'd be happy to help! Please provide the actual input caption you'd like me to read and I'll assist you with writing an instruction to {optional_verbs} an object to the image, writing the added object and writing the caption with the object."""  
    random.seed(torch.randint(1 << 32, ()).item())
    shuffle(few_shot_examples)
    few_shot_examples = few_shot_examples[:int(len(few_shot_examples) * 0.6)] 
    prompt = system_message + intro_message + "".join(few_shot_examples)  # add the test prompt 
    prompt = prompt + f"[INST]User: {new_prompt}[/INST]"
    return prompt

2.2.3 图片对的生成

通过上面的步骤我们拿到了4元组 $c_T, c_I, x, v_{j})$ ,中的 $c_T, c_I, v_{j}$ ,其中 $c_T$ 还有很多附加信息：

如：编辑的对象，新的image caption，如：

{“edit”: “include a hat”, “edited object”: “hat”, “output”: “Beautiful cat wearing a hat with mojito sitting in a cafe on the street”, “original object”: “cat”}

此处需要进行的是根据上面的条件，得到对应的图片pair ( $x$ )。

任务目标：根据输入图片、instruction信息生成对应的图片pair ( $x$ )并且除了编辑的区域， $x$ 与 $c_I$ 的差异应当尽可能的小。
$\begin{aligned} &\max \mathrm{SIM}(\mathrm{Instruction}, c_I^{\mathrm{Edit}}) \\ &\min \mathrm{Dist}(x, c_I^{\mathrm{not Edit}}) \end{aligned} \tag{4}$

为了解决上述的任务目标，作者提出一种mask-based attention control的方法(相当于DiffEdit和P2P的结合)。具体分为以下几个步骤：

已知条件：

$\mathrm{Cap_{ori}}$ : image caption 。 example：Beautiful cat with mojito sitting in a cafe on the street
$\mathrm{Img_{ori}}:$ image caption用DM生成的图片
$\mathrm{Cap_{edit}}$ :编辑后的image caption。Beautiful cat wearing a hat with mojito sitting in a cafe on the street
$\mathrm{Obj_{ori}}$ image caption的原始目标（original object）。cat
$\mathrm{Obj_{edit}}$ 编辑目标(edited object)：hat