Diffusion Models
This page explains how the diffusion model system is designed: the Template Method pattern, mixins, factory dispatch, warm caching, and device optimization.
Design Goals
The model system was designed around three constraints:
One model at a time — GPU memory is limited. Only one diffusion model can be loaded at once. You don’t need a mountain of H200s to run this framework. I use a stock NVIDIA RX 3080 and a Mac M4 as two of my workstations.
Minimal per-model code — Adding a new model should take 10-20 lines, not 200.
Configuration-driven behavior — Model quirks (turbo guidance overrides, token limits, 8-bit quantization) are controlled by flags in the data model (seeded with
presets.json), not by branching code paths.
Template Method Pattern
All diffusion models inherit from BaseModel, which provides two template methods — concrete methods that define the overall algorithm and call abstract or optional hooks for customization:
classDiagram
class BaseModel {
<<abstract>>
+load_pipeline() : Template Method
+generate() : Template Method
#_create_pipeline()* : Abstract Hook
#_build_prompts() : Optional Hook
#_build_pipeline_kwargs() : Optional Hook
#_apply_device_optimizations() : Optional Hook
#_handle_special_prompt_requirements() : Optional Hook
}
class CompelPromptMixin {
#_build_prompts() override
-_get_compel()
}
class CLIPTokenLimitMixin {
#_build_prompts() override
-_fit_prompt_to_token_limit()
}
class DebugLoggingMixin {
#_debug_print()
}
class FluxModel {
#_create_pipeline()
}
class SDXLModel {
#_create_pipeline()
}
class SD15Model {
#_create_pipeline()
}
class QwenImageModel {
#_create_pipeline()
#_build_pipeline_kwargs()
#_handle_special_prompt_requirements()
}
BaseModel <|-- FluxModel
BaseModel <|-- QwenImageModel
CompelPromptMixin <|-- SDXLModel
BaseModel <|-- SDXLModel
CompelPromptMixin <|-- SD15Model
BaseModel <|-- SD15Model
load_pipeline()
The loading template method runs three steps:
Device setup — detect MPS (Apple Silicon), CUDA, or CPU
Pipeline creation — calls
_create_pipeline()(the one abstract method each model must implement)Device optimizations — calls
_apply_device_optimizations()for CPU offloading, attention slicing, and VAE fixes
generate()
The generation template method runs six steps:
Scheduler override — apply per-job scheduler if specified
Parameter preparation — resolve defaults, apply LoRA overrides, handle
force_default_guidanceflagPrompt building — calls
_build_prompts()to append LoRA trigger words and handle token limitsPipeline kwargs — calls
_build_pipeline_kwargs()for model-specific parametersImage generation — calls the pipeline and returns the first image
Cleanup — clears device memory cache and builds metadata
Hook Methods
Concrete models customize behavior by overriding these hooks:
Hook |
Purpose |
|---|---|
|
Required. Return the loaded pipeline instance (e.g., |
|
Customize prompt processing. Default appends LoRA suffix. Overridden by |
|
Add model-specific kwargs. Default builds standard kwargs. Overridden by |
|
Custom device setup. Default handles MPS VAE float32 fix, CPU offloading, attention slicing. |
|
Model-specific prompt quirks. Overridden by |
Mixins
Shared behaviors are composed via multiple inheritance. Mixins must be listed before BaseModel in the class definition to properly override hooks via Python’s MRO (Method Resolution Order).
CompelPromptMixin
Used by SDXL and SD15 models for advanced prompt handling via the Compel library:
Long prompts — breaks prompts longer than 77 tokens into chunks and concatenates the embeddings, removing the CLIP token limit
Prompt weighting —
(dramatic lighting:1.5)syntax to emphasize or de-emphasize conceptsLoRA integration — trigger words appended without truncation concerns
Dual text encoder support — automatically detects and handles SDXL’s two text encoders
The mixin overrides _build_prompts() to convert text into pre-computed embeddings (prompt_embeds), which are passed directly to the pipeline instead of raw text.
CLIPTokenLimitMixin (Legacy)
The original approach to CLIP’s 77-token limit. Truncates prompts to fit within the token budget while prioritizing LoRA trigger words. Superseded by CompelPromptMixin for new models, but still available.
DebugLoggingMixin
Provides a _debug_print() method gated behind the enable_debug_logging configuration flag. Used by turbo models for diagnostic output during development.
Concrete Models
Each concrete model is a thin subclass — typically 20–70 lines — that only overrides what differs from the base:
Model |
Pipeline |
Parents |
Lines |
Key Overrides |
|---|---|---|---|---|
ZImageTurboModel |
|
DebugLogging + Base |
~108 |
Custom device optimizations, step callback, LoRA VAE fixes |
FluxModel |
|
Base only |
~22 |
|
QwenImageModel |
|
Base only |
~87 |
8-bit quantization, runtime |
SDXLModel |
|
CLIPTokenLimit + Base |
~52 |
Token-limited prompts, negative prompt support, LoRA VAE fixes |
SDXLTurboModel |
|
CLIPTokenLimit + DebugLogging + Base |
~99 |
Token limit with debug logging, custom MPS optimizations |
SD15Model |
|
CLIPTokenLimit + Base |
~51 |
Token-limited prompts, negative prompt support, LoRA VAE fixes |
A minimal model implementation looks like this:
from diffusers import FluxPipeline
from .base import BaseModel
class FluxModel(BaseModel):
def _create_pipeline(self):
return FluxPipeline.from_pretrained(
self.model_path,
torch_dtype=self.dtype,
)
Everything else — device setup, generation loop, LoRA management, metadata building, cache clearing — is inherited from BaseModel.
ModelFactory
ModelFactory.create_model() dispatches on the pipeline field from presets.json:
Pipeline Name |
Model Class |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
Configuration Flags
Model behavior is driven by flags in data/presets.json under each model’s settings block:
Flag |
Effect |
|---|---|
|
Always use the model’s default |
|
Maximum text encoder context length. |
|
Enable 8-bit quantization for reduced VRAM usage. Supported by QwenImageModel. |
|
Use |
|
Process VAE decode in slices for lower peak memory usage. |
|
Enable verbose debug output via |
|
Whether the model accepts negative prompts. Juggernaut XL, Realistic Vision, and Qwen support this. |
|
Default scheduler class name (e.g., |
Warm Cache
The Celery worker maintains a module-level model cache — a dictionary that keeps the most recently used model loaded in memory:
_model_cache = {} # {slug: model_instance}
This is safe because Celery is configured with the solo pool (single-threaded worker). The cache works as follows:
A generation task calls
_load_model_instance(diffusion_model)If the requested model slug matches the cached model and its pipeline is loaded, the cached instance is returned immediately (warm hit)
If a different model is cached, it is evicted — its pipeline is deleted and the device cache is cleared (
torch.mps.empty_cache()ortorch.cuda.empty_cache())The new model is instantiated via
ModelFactory, its pipeline is loaded, and it replaces the evicted model in the cache
This ensures:
No redundant loading — consecutive jobs using the same model skip pipeline loading entirely
One model at a time — only one diffusion model occupies GPU memory at any given moment
Clean transitions — eviction explicitly frees device memory before loading the next model
Before loading any diffusion model, the worker also calls _evict_enhancer() to free VRAM occupied by the prompt enhancement LLM or the adaptation pipeline’s LLM, preventing out-of-memory errors.
Device Optimization
BaseModel._apply_device_optimizations() handles three device targets:
- MPS (Apple Silicon)
VAE converted to
float32(prevents NaN values in bfloat16 VAE on MPS)Entire pipeline moved to MPS device
Attention slicing enabled
VAE slicing and tiling enabled for memory efficiency
- CUDA
enable_model_cpu_offload()by default (orenable_sequential_cpu_offload()if the flag is set)Attention slicing enabled
- CPU
Pipeline moved to CPU (fallback, no optimizations)
Individual models can override _apply_device_optimizations() for model-specific needs. For example, ZImageTurboModel adds sequential CPU offload and VAE slicing on CUDA, while SDXLTurboModel has a custom MPS optimization path.
Model Comparison
Model |
Steps |
CFG |
Resolution |
Architecture |
Best For |
|---|---|---|---|---|---|
Z-Image Turbo |
9 |
0.0 |
1024 |
zimage |
Fast iteration, drafts |
Flux.1-dev |
28 |
3.5 |
1024 |
flux1 |
High quality, presentation |
Qwen-Image |
50 |
4.5 |
1328 |
qwen |
Photorealism, fine detail |
SDXL Turbo |
4 |
0.0 |
512 |
sdxl |
Fastest, quick tests |
Juggernaut XL v9 |
30 |
7.0 |
1024 |
sdxl |
Photorealism, negative prompts |
DreamShaper XL |
4 |
2.0 |
1024 |
sdxl |
Fast creative, artistic |
Realistic Vision v5.1 |
30 |
5.0 |
512x768 |
sd15 |
Portraits, low VRAM |
Adding a New Model
For most models, this takes four steps:
Create the model file — inherit from
BaseModeland override_create_pipeline()Register in the factory — add to
ModelFactory.create_model()insrc/cw/lib/models/__init__.pyAdd to presets — add a model entry in
data/presets.jsonwith appropriate settings flagsSync to database — run
uv run manage.py import_presets
If the model uses CLIP-based text encoding (SDXL/SD15 family), inherit from CompelPromptMixin for long prompt support and prompt weighting. If the model has unusual requirements (runtime parameter detection, special prompt handling), override the relevant hook methods.