Diffusion | Notion

Model Size

Llama 7B

7B → 13.5 GB (14GB)
- FP16 → 14GB
- Total: 14 * 4 = 56GB

FLUX.1-dev (= FLUX.1-schnell)

FluxTransformer2DModel: 8B → 23.81GB (FP32 ?)

# [0] NVIDIA H100 80GB HBM3 | 35°C,   0 % |   695 / 81559 MB | taehoon(686M)

AutoencoderKL (16ch): → 168MB

# [0] NVIDIA H100 80GB HBM3 | 35°C,   0 % |   695 / 81559 MB | taehoon(686M)

openai/clip-vit-large-patch14 (CLIPTextModel): → 246MB (bfloat16)
google/t5-v1_1-xxl (T5EncoderModel): → 9.52GB (???)
- Original: 44.5 GB
- DeepFloyd/t5-v1_1-xxl : 19.16GB
- comfyanonymous/flux_text_encoders/t5xxl_fp8_e4m3fn.safetensors : 4.89 GB
- comfyanonymous/flux_text_encoders/t5xxl_fp16.safetensors : 9.79GB

# [0] NVIDIA H100 80GB HBM3 | 35°C,   0 % | 9084 / 81559 MB | taehoon(10056M)  -> bfloat16
# [0] NVIDIA H100 80GB HBM3 | 34°C,   0 % | 18693 / 81559 MB | root(18684M)    ->  float32

Debugging image

SD3.5-medium

At 2.5 billion parameters, with improved MMDiT-X architecture and training methods, this model is designed to run “out of the box” on consumer hardware, striking a balance between quality and ease of customization. It is capable of generating images ranging between 0.25 and 2 megapixel resolution.

For the Medium model specifically, we made several adjustments to the architecture and training protocols to enhance quality, coherence, and multi-resolution generation abilities.

SD3Transformer2DModel: 2.5B → 4.94 GB
AutoencoderKL (16ch) → 168 MB

SD3

FluxTransformer2DModel: 2B → 4.17 GB
AutoencoderKL (16ch): → 168MB
CLIP_G → 1.39GB
CLIP_L → 246MB
google/t5-v1_1-xxl (T5EncoderModel):
- FP16 → 9.79GB (t5xxl_fp16.safetensors)
- FP8 →4.89 GB (t5xxl_fp8_e4m3fn.safetensors)

SDXL

UNet: 2.6B → 6.94GB
Refiner: 6.6B → 6.08GB
VAE (4ch):
- FP16: → 167MB
- FP32: → 335MB
CLIPTextModel: clip-vit-large-patch14 variant (428M)
- FP16: → 246MB
- FP32: → 492MB
CLIPTextModelWithProjection: laion/CLIP-ViT-bigG-14-laion2B-39B-b160k variant
- FP16: → 1.39GB
- FP32: → 2.78GB

SD 2.1

860M

SD 1.5

983M

SD3.5

FLUX

name = flux-schnell

Batch_size = 1

Width = 1360

Height = 768

flux/src/flux/modules/layers.py at main · black-forest-labs/flux

import torch
from einops import rearrange
from torch import Tensor

def attention(q: Tensor, k: Tensor, v: Tensor, pe: Tensor) -> Tensor:
    q, k = apply_rope(q, k, pe)

    x = torch.nn.functional.scaled_dot_product_attention(q, k, v)
    x = rearrange(x, "B H L D -> B L (H D)")

    return x

def apply_rope(xq: Tensor, xk: Tensor, freqs_cis: Tensor) -> tuple[Tensor, Tensor]:
    
    # xq: [1, 24, 4336, 128], bfloat16
    # xk: [1, 24, 4336, 128], bfloat16
    # freqs_cis: [1, 1, 4336, 64, 2, 2], float32
    
    xq_ = xq.float().reshape(*xq.shape[:-1], -1, 1, 2)
    xk_ = xk.float().reshape(*xk.shape[:-1], -1, 1, 2)
    
    # xq_: [1, 24, 4336, 64, 1, 2], float32
    # xk_: [1, 24, 4336, 64, 1, 2], float32
    
    xq_out = freqs_cis[..., 0] * xq_[..., 0] + freqs_cis[..., 1] * xq_[..., 1]
    xk_out = freqs_cis[..., 0] * xk_[..., 0] + freqs_cis[..., 1] * xk_[..., 1]
    
    # xq_out: [1, 24, 4336, 64, 2], float32
    # xk_out: [1, 24, 4336, 64, 2], float32
    
    return xq_out.reshape(*xq.shape).type_as(xq), xk_out.reshape(*xk.shape).type_as(xk)

Others

Time Step Embedding
MLP Embedder
EmbedND
Last Layer

SD

1.5 Unet

Stable Diffusion1.5 network structure - super detailed original

SDXL parameters (Detailed)

On the Scalability of Diffusion-based Text-to-Image Generation

On the Scalability of Diffusion-based Text-to-Image Generation