当前位置: 首页 > article >正文

The Illustrated Stable Diffusion

The Illustrated Stable Diffusion

  • 1. The Components of Stable Diffusion
    • 1.1. Image information creator
    • 1.2. Image Decoder
  • 2. What is Diffusion anyway?
    • 2.1. How does diffusion work?
    • 2.2. Painting images by removing noise
  • References

https://jalammar.github.io/illustrated-stable-diffusion/

This is a gentle introduction to how Stable Diffusion works.

text-to-image,text2img,T2I

在这里插入图片描述

paradise /ˈpærədaɪs/ n. 天堂,乐园 (指美好的环境),(某些宗教所指的) 天国,乐土,伊甸园,(某类活动或某类人的) 完美去处
cosmic /ˈkɑːzmɪk/ adj. 宇宙的,巨大且重要的
beach /biːtʃ/ n. 海滩,沙滩,海滨,湖滨 v. (使) 上岸,把...拖上岸

Stable Diffusion is versatile in that it can be used in a number of different ways. Let’s focus at first on image generation from text only (text2img). The image above shows an example text input and the resulting generated image (The actual complete prompt is here). Aside from text to image, another main way of using it is by making it alter images (so inputs are text + image).
除了将文本转换为图像之外,另一种主要的使用方式是让它改变图像 (因此输入是文本 + 图像)。

versatile /ˈvɜːrsətl/ adj. 多功能的,多才多艺的,多用途的,多面手的,有多种技能的
alter /ˈɔːltər/ v. 改变,(使) 更改,修改 (衣服使更合身),改动
pirate /ˈpaɪrət/ n. 海盗,盗版者,盗印者,道德败坏者,违法者,侵犯专利权者,非法仿制者,非法播音者 v. 当海盗,从事劫掠,盗印,窃用,以海盗方式劫掠,抢掠 adj. 盗版的,盗用的,剽窃的

在这里插入图片描述

1. The Components of Stable Diffusion

Stable Diffusion is a system made up of several components and models. It is not one monolithic model.

monolithic /ˌmɑːnəˈlɪθɪk/ adj. 庞大而单一的,整体式的,单一的,独块巨石的,整料的,由块料组成的,单片的,单块的,庞大而无特点的,巨大而单调的 n. 单片电路,单块电路

As we look under the hood, the first observation we can make is that there’s a text-understanding component that translates the text information into a numeric representation that captures the ideas in the text.

hood /hʊd/ n. (设备或机器的) 防护罩,罩,风帽,兜帽 (外衣的一部分,可拉起蒙住头颈),(布质) 面罩,街区,学位连领帽 (表示学位种类),(汽车等的) 折叠式车篷 vt. 罩上,覆盖

在这里插入图片描述

We’re starting with a high-level view and we’ll get into more machine learning details later in this article. However, we can say that this text encoder is a special Transformer language model (technically: the text encoder of a CLIP model). It takes the input text and outputs a list of numbers representing each word/token in the text (a vector per token).

That information is then presented to the Image Generator, which is composed of a couple of components itself.

在这里插入图片描述

The image generator goes through two stages:

1.1. Image information creator

This component is the secret sauce of Stable Diffusion. It’s where a lot of the performance gain over previous models is achieved.

sauce /sɔːs/ n. 酱,调味汁,无礼的话 (或举动),讨厌的话 (或举动) vt. 对...无礼,给...增加趣味或风味,调味或加沙司于...

This component runs for multiple steps to generate image information. This is the steps parameter in Stable Diffusion interfaces and libraries which often defaults to 50 or 100.

The image information creator works completely in the image information space (or latent space). We’ll talk more about what that means later in the post. This property makes it faster than previous diffusion models that worked in pixel space. In technical terms, this component is made up of a UNet neural network and a scheduling algorithm.

The word “diffusion” describes what happens in this component. It is the step by step processing of information that leads to a high-quality image being generated in the end (by the next component, the image decoder).

在这里插入图片描述

1.2. Image Decoder

The image decoder paints a picture from the information it got from the information creator. It runs only once at the end of the process to produce the final pixel image.

在这里插入图片描述

With this we come to see the three main components (each with its own neural network) that make up Stable Diffusion:

  • ClipText for text encoding.

Input: text.
Output: 77 token embeddings vectors, each in 768 dimensions.

  • UNet + Scheduler to gradually process/diffuse information in the information (latent) space.

Input: text embeddings and a starting multi-dimensional array (structured lists of numbers, also called a tensor) made up of noise.
Output: A processed information array

  • Autoencoder Decoder that paints the final image using the processed information array.

Input: The processed information array (dimensions: (4,64,64))
Output: The resulting image (dimensions: (3, 512, 512) which are (red/green/blue, width, height))

diffuse /dɪˈfjuːs , dɪˈfjuːz/ adj. 扩散的,漫射的,弥漫的,不清楚的,冗长的,难解的,啰唆的 v. (使气体或液体) 扩散,弥漫,渗透,(使光) 模糊,漫射,漫散,传播,使分散,散布,普及

在这里插入图片描述

2. What is Diffusion anyway?

Diffusion is the process that takes place inside the pink “image information creator” component. Having the token embeddings that represent the input text, and a random starting image information array (these are also called latents), the process produces an information array that the image decoder uses to paint the final image.

在这里插入图片描述

This process happens in a step-by-step fashion. Each step adds more relevant information. To get an intuition of the process, we can inspect the random latents array, and see that it translates to visual noise. Visual inspection in this case is passing it through the image decoder.

intuition /ˌɪntuˈɪʃn/ n. (一种) 直觉,直觉力

在这里插入图片描述

Diffusion happens in multiple steps, each step operates on an input latents array, and produces another latents array that better resembles the input text and all the visual information the model picked up from all images the model was trained on.

resemble /rɪˈzembl/ vt. 像,类似于,看起来像,显得像

在这里插入图片描述

We can visualize a set of these latents to see what information gets added at each step.

在这里插入图片描述

The process is quite breathtaking to look at.

在这里插入图片描述在这里插入图片描述

在这里插入图片描述在这里插入图片描述

在这里插入图片描述在这里插入图片描述

在这里插入图片描述在这里插入图片描述

在这里插入图片描述在这里插入图片描述

Something especially fascinating happens between steps 2 and 4 in this case. It’s as if the outline emerges from the noise.

emerge /iˈmɜːrdʒ/ v. (从隐蔽处或暗处) 出现,浮现,显现,暴露,露出,显露,被知晓,幸存下来,摆脱出来,露头,露出真相

在这里插入图片描述在这里插入图片描述

2.1. How does diffusion work?

The central idea of generating images with diffusion models relies on the fact that we have powerful computer vision models. Given a large enough dataset, these models can learn complex operations. Diffusion models approach image generation by framing the problem as following:

Say we have an image, we generate some noise, and add it to the image.

在这里插入图片描述

This can now be considered a training example. We can use this same formula to create lots of training examples to train the central component of our image generation model.

在这里插入图片描述

While this example shows a few noise amount values from image (amount 0, no noise) to total noise (amount 4, total noise), we can easily control how much noise to add to the image, and so we can spread it over tens of steps, creating tens of training examples per image for all the images in a training dataset.

在这里插入图片描述

With this dataset, we can train the noise predictor and end up with a great noise predictor that actually creates images when run in a certain configuration. A training step should look familiar if you’ve had ML exposure:

在这里插入图片描述

2.2. Painting images by removing noise

Let’s now see how this can generate images.

The trained noise predictor can take a noisy image, and the number of the denoising step, and is able to predict a slice of noise.

在这里插入图片描述

The sampled noise is predicted so that if we subtract it from the image, we get an image that’s closer to the images the model was trained on (not the exact images themselves, but the distribution - the world of pixel arrangements where the sky is usually blue and above the ground, people have two eyes, cats look a certain way - pointy ears and clearly unimpressed).

unimpressed /ˌʌnɪmˈprest/ adj. 印象平平的,无深刻印象的

在这里插入图片描述

If the training dataset was of aesthetically pleasing images (e.g., LAION Aesthetics https://laion.ai/blog/laion-aesthetics/, which Stable Diffusion was trained on), then the resulting image would tend to be aesthetically pleasing. If the we train it on images of logos, we end up with a logo-generating model.

aesthetical [i:s'θetɪkəl]:adj. 美的,美学的,审美的
please [pliz] int. 请务必,请问,太感谢了,收敛点儿 v. 喜欢,使满意,使愉快

在这里插入图片描述

References

[1] Yongqiang Cheng, https://yongqiang.blog.csdn.net/


http://www.kler.cn/a/593664.html

相关文章:

  • 电机控制常见面试问题(十四)
  • pytorch v1.4.0安装问题
  • 2024年国赛高教杯数学建模E题交通流量管控解题全过程文档及程序
  • 嵌入式系统开发如何选择和备考软考高级
  • webpack等构建工具如何支持移除未使用的代码
  • 基于carla的模仿学习(附数据集CORL2017)更新中........
  • WPF 中的 GridSplitter 详解
  • 不使用负压电源,ADC如何测量正负压?
  • 为什么渲染农场渲染帧而非视频?核心原因 + 举例
  • Neo4j GDS-02-graph-data-science 简单聊一聊图数据科学插件库
  • 计算机网络基础:设计高效的网络布局
  • 使用cartographer扩展地图
  • 【Linux】VMware 17 安装 VMware Tools
  • 网络运维学习笔记(DeepSeek优化版) 019 HCIA-Datacom新增知识点01网络管理与运维
  • docker 创建mysql5.7 并开启bin_log和general_log日志审计功能
  • docker 内 pytorch cuda 不可用
  • 【JavaEE】传递和接收数据,Spring MVC 注解搭建前后端交互的「隐形桥梁」
  • Android Compose 框架图片加载深入剖析(六)
  • 【Linux】统信操作系统进入单用户如何修改密码
  • 通过AI自动生成springboot的CRUD以及单元测试与压力测试源码(完整版)