FluxTransformer2DModel
: 8B → 23.81GB (FP32 ?)# [0] NVIDIA H100 80GB HBM3 | 35°C, 0 % | 695 / 81559 MB | taehoon(686M)
AutoencoderKL
(16ch): → 168MB
# [0] NVIDIA H100 80GB HBM3 | 35°C, 0 % | 695 / 81559 MB | taehoon(686M)
openai/clip-vit-large-patch14
(CLIPTextModel
): → 246MB (bfloat16
)
google/t5-v1_1-xxl
(T5EncoderModel
): → 9.52GB (???)
# [0] NVIDIA H100 80GB HBM3 | 35°C, 0 % | 9084 / 81559 MB | taehoon(10056M) -> bfloat16
# [0] NVIDIA H100 80GB HBM3 | 34°C, 0 % | 18693 / 81559 MB | root(18684M) -> float32
At 2.5 billion parameters, with improved MMDiT-X architecture and training methods, this model is designed to run “out of the box” on consumer hardware, striking a balance between quality and ease of customization. It is capable of generating images ranging between 0.25 and 2 megapixel resolution.
For the Medium model specifically, we made several adjustments to the architecture and training protocols to enhance quality, coherence, and multi-resolution generation abilities.
SD3Transformer2DModel
: 2.5B → 4.94 GBAutoencoderKL
(16ch) → 168 MBFluxTransformer2DModel
: 2B → 4.17 GBAutoencoderKL
(16ch): → 168MBgoogle/t5-v1_1-xxl
(T5EncoderModel
):
t5xxl_fp16.safetensors
)t5xxl_fp8_e4m3fn.safetensors
)CLIPTextModel
: clip-vit-large-patch14 variant (428M)
CLIPTextModelWithProjection
: laion/CLIP-ViT-bigG-14-laion2B-39B-b160k variant
name = flux-schnell
Batch_size = 1
Width = 1360
Height = 768
flux/src/flux/modules/layers.py at main · black-forest-labs/flux
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Scaling Vision Transformers to 22 Billion Parameters
flux/src/flux/math.py at main · black-forest-labs/flux
import torch
from einops import rearrange
from torch import Tensor
def attention(q: Tensor, k: Tensor, v: Tensor, pe: Tensor) -> Tensor:
q, k = apply_rope(q, k, pe)
x = torch.nn.functional.scaled_dot_product_attention(q, k, v)
x = rearrange(x, "B H L D -> B L (H D)")
return x
def apply_rope(xq: Tensor, xk: Tensor, freqs_cis: Tensor) -> tuple[Tensor, Tensor]:
# xq: [1, 24, 4336, 128], bfloat16
# xk: [1, 24, 4336, 128], bfloat16
# freqs_cis: [1, 1, 4336, 64, 2, 2], float32
xq_ = xq.float().reshape(*xq.shape[:-1], -1, 1, 2)
xk_ = xk.float().reshape(*xk.shape[:-1], -1, 1, 2)
# xq_: [1, 24, 4336, 64, 1, 2], float32
# xk_: [1, 24, 4336, 64, 1, 2], float32
xq_out = freqs_cis[..., 0] * xq_[..., 0] + freqs_cis[..., 1] * xq_[..., 1]
xk_out = freqs_cis[..., 0] * xk_[..., 0] + freqs_cis[..., 1] * xk_[..., 1]
# xq_out: [1, 24, 4336, 64, 2], float32
# xk_out: [1, 24, 4336, 64, 2], float32
return xq_out.reshape(*xq.shape).type_as(xq), xk_out.reshape(*xk.shape).type_as(xk)
flux/src/flux/modules/layers.py at main · black-forest-labs/flux
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Scaling Vision Transformers to 22 Billion Parameters
Scalable Diffusion Models with Transformers
https://github.com/brayevalerien/Flux.1-Architecture-Diagram/blob/main/flux_architecture_diagram.png
https://www.reddit.com/r/StableDiffusion/comments/1fds59s/a_detailled_flux1_architecture_diagram/
https://blog.csdn.net/qq_62075214/article/details/142494784
Stable Diffusion1.5 network structure - super detailed original
On the Scalability of Diffusion-based Text-to-Image Generation