Model Size

Llama 7B

FLUX.1-dev (= FLUX.1-schnell)

# [0] NVIDIA H100 80GB HBM3 | 35°C,   0 % |   695 / 81559 MB | taehoon(686M)
# [0] NVIDIA H100 80GB HBM3 | 35°C,   0 % | 9084 / 81559 MB | taehoon(10056M)  -> bfloat16
# [0] NVIDIA H100 80GB HBM3 | 34°C,   0 % | 18693 / 81559 MB | root(18684M)    ->  float32

SD3.5-medium

At 2.5 billion parameters, with improved MMDiT-X architecture and training methods, this model is designed to run “out of the box” on consumer hardware, striking a balance between quality and ease of customization. It is capable of generating images ranging between 0.25 and 2 megapixel resolution.

For the Medium model specifically, we made several adjustments to the architecture and training protocols to enhance quality, coherence, and multi-resolution generation abilities.

SD3

SDXL

image.png

SD 2.1

SD 1.5

SD3.5


image.png

image.png

FLUX


name = flux-schnell

Batch_size = 1

Width = 1360

Height = 768

flux/src/flux/modules/layers.py at main · black-forest-labs/flux

image.png

Double Stream Block

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

flux_double_stream_block.svg

Single Stream Block

Scaling Vision Transformers to 22 Billion Parameters

flux_single_stream_block.svg

RoPE

flux/src/flux/math.py at main · black-forest-labs/flux

import torch
from einops import rearrange
from torch import Tensor

def attention(q: Tensor, k: Tensor, v: Tensor, pe: Tensor) -> Tensor:
    q, k = apply_rope(q, k, pe)

    x = torch.nn.functional.scaled_dot_product_attention(q, k, v)
    x = rearrange(x, "B H L D -> B L (H D)")

    return x

def apply_rope(xq: Tensor, xk: Tensor, freqs_cis: Tensor) -> tuple[Tensor, Tensor]:
    
    # xq: [1, 24, 4336, 128], bfloat16
    # xk: [1, 24, 4336, 128], bfloat16
    # freqs_cis: [1, 1, 4336, 64, 2, 2], float32
    
    xq_ = xq.float().reshape(*xq.shape[:-1], -1, 1, 2)
    xk_ = xk.float().reshape(*xk.shape[:-1], -1, 1, 2)
    
    # xq_: [1, 24, 4336, 64, 1, 2], float32
    # xk_: [1, 24, 4336, 64, 1, 2], float32
    
    xq_out = freqs_cis[..., 0] * xq_[..., 0] + freqs_cis[..., 1] * xq_[..., 1]
    xk_out = freqs_cis[..., 0] * xk_[..., 0] + freqs_cis[..., 1] * xk_[..., 1]
    
    # xq_out: [1, 24, 4336, 64, 2], float32
    # xk_out: [1, 24, 4336, 64, 2], float32
    
    return xq_out.reshape(*xq.shape).type_as(xq), xk_out.reshape(*xk.shape).type_as(xk)

Others

References

flux/src/flux/modules/layers.py at main · black-forest-labs/flux

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Scaling Vision Transformers to 22 Billion Parameters

Scalable Diffusion Models with Transformers

image.png

image.png

https://github.com/brayevalerien/Flux.1-Architecture-Diagram/blob/main/flux_architecture_diagram.png

https://www.reddit.com/r/StableDiffusion/comments/1fds59s/a_detailled_flux1_architecture_diagram/

https://blog.csdn.net/qq_62075214/article/details/142494784

https://blog.csdn.net/qq_62075214/article/details/142494784

image.png

image.png

image.png

image.png

image.png

SD

1.5 Unet


Stable Diffusion1.5 network structure - super detailed original

SDXL parameters (Detailed)

On the Scalability of Diffusion-based Text-to-Image Generation

On the Scalability of Diffusion-based Text-to-Image Generation