Drawing software: ProcessOn, the following pictures can be saved for high-definition viewing

1 Unet

1.0 Introduction

Responsible for predicting noise

1.1 Detailed overall structure

https://img-blog.csdnimg.cn/eb191cc46d684be49c139166c4d9777c.png

1.2 Reduced version of the overall structure

https://img-blog.csdnimg.cn/19ef136c9a954779b2770dbee906d48f.png

1.3 Time step encoding

https://img-blog.csdnimg.cn/7303e812424643a0ad52bc552fd58cee.png

1.4 CrossAttnDownBlock2D

Each ResnetBlock2D has two inputs

  1. One is the output latent from the previous layer,

  2. Another output from the time step encoding module time_embeds ( shape = [2, 1280], omitted below, the default [2, 1280] is the shape of tersor)

Each Transformer2DModel input has two

  1. Output of the previous layer

  2. CLIP text_encoder text embedding, or prompt embedding, whose shape = [2, 77, 768]

The input format of all modules with ResnetBlock2D and Transformer2DModel is the same. For convenience, the time_embeds and prompt embedding inputs of some modules are not drawn by default, such as UnetMidBlock2DCrossAttn, UpBlock2D, CrossAttnUpBlock2D