Drawing software: ProcessOn, the following pictures can be saved for high-definition viewing
Responsible for predicting noise
Each ResnetBlock2D has two inputs
One is the output latent from the previous layer,
Another output from the time step encoding module time_embeds ( shape = [2, 1280], omitted below, the default [2, 1280] is the shape of tersor)
Each Transformer2DModel input has two
Output of the previous layer
CLIP text_encoder text embedding, or prompt embedding, whose shape = [2, 77, 768]
The input format of all modules with ResnetBlock2D and Transformer2DModel is the same. For convenience, the time_embeds and prompt embedding inputs of some modules are not drawn by default, such as UnetMidBlock2DCrossAttn, UpBlock2D, CrossAttnUpBlock2D