Diffusion Models

This page explains how the diffusion model system is designed: the Template Method pattern, mixins, factory dispatch, warm caching, and device optimization.

Design Goals

The model system was designed around three constraints:

One model at a time — GPU memory is limited. Only one diffusion model can be loaded at once. You don’t need a mountain of H200s to run this framework. I use a stock NVIDIA RX 3080 and a Mac M4 as two of my workstations.
Minimal per-model code — Adding a new model should take 10-20 lines, not 200.
Configuration-driven behavior — Model quirks (turbo guidance overrides, token limits, 8-bit quantization) are controlled by flags in the data model (seeded with presets.json), not by branching code paths.

Template Method Pattern

All diffusion models inherit from BaseModel, which provides two template methods — concrete methods that define the overall algorithm and call abstract or optional hooks for customization:

        classDiagram
    class BaseModel {
        <<abstract>>
        +load_pipeline() : Template Method
        +generate() : Template Method
        #_create_pipeline()* : Abstract Hook
        #_build_prompts() : Optional Hook
        #_build_pipeline_kwargs() : Optional Hook
        #_apply_device_optimizations() : Optional Hook
        #_handle_special_prompt_requirements() : Optional Hook
    }
    class CompelPromptMixin {
        #_build_prompts() override
        -_get_compel()
    }
    class CLIPTokenLimitMixin {
        #_build_prompts() override
        -_fit_prompt_to_token_limit()
    }
    class DebugLoggingMixin {
        #_debug_print()
    }
    class FluxModel {
        #_create_pipeline()
    }
    class SDXLModel {
        #_create_pipeline()
    }
    class SD15Model {
        #_create_pipeline()
    }
    class QwenImageModel {
        #_create_pipeline()
        #_build_pipeline_kwargs()
        #_handle_special_prompt_requirements()
    }

    BaseModel <|-- FluxModel
    BaseModel <|-- QwenImageModel
    CompelPromptMixin <|-- SDXLModel
    BaseModel <|-- SDXLModel
    CompelPromptMixin <|-- SD15Model
    BaseModel <|-- SD15Model

`load_pipeline()`

The loading template method runs three steps:

Device setup — detect MPS (Apple Silicon), CUDA, or CPU
Pipeline creation — calls _create_pipeline() (the one abstract method each model must implement)
Device optimizations — calls _apply_device_optimizations() for CPU offloading, attention slicing, and VAE fixes

`generate()`

The generation template method runs six steps:

Scheduler override — apply per-job scheduler if specified
Parameter preparation — resolve defaults, apply LoRA overrides, handle force_default_guidance flag
Prompt building — calls _build_prompts() to append LoRA trigger words and handle token limits
Pipeline kwargs — calls _build_pipeline_kwargs() for model-specific parameters
Image generation — calls the pipeline and returns the first image
Cleanup — clears device memory cache and builds metadata

Hook Methods

Concrete models customize behavior by overriding these hooks:

Hook	Purpose
`_create_pipeline()`	Required. Return the loaded pipeline instance (e.g., `FluxPipeline.from_pretrained(...)`)
`_build_prompts(params)`	Customize prompt processing. Default appends LoRA suffix. Overridden by `CompelPromptMixin` and `CLIPTokenLimitMixin`.
`_build_pipeline_kwargs(params, callback)`	Add model-specific kwargs. Default builds standard kwargs. Overridden by `QwenImageModel` (runtime parameter detection).
`_apply_device_optimizations()`	Custom device setup. Default handles MPS VAE float32 fix, CPU offloading, attention slicing.
`_handle_special_prompt_requirements(params)`	Model-specific prompt quirks. Overridden by `QwenImageModel` (requires space for empty negative prompt).

Mixins

Shared behaviors are composed via multiple inheritance. Mixins must be listed before BaseModel in the class definition to properly override hooks via Python’s MRO (Method Resolution Order).

CompelPromptMixin

Used by SDXL and SD15 models for advanced prompt handling via the Compel library:

Long prompts — breaks prompts longer than 77 tokens into chunks and concatenates the embeddings, removing the CLIP token limit
Prompt weighting — (dramatic lighting:1.5) syntax to emphasize or de-emphasize concepts
LoRA integration — trigger words appended without truncation concerns
Dual text encoder support — automatically detects and handles SDXL’s two text encoders

The mixin overrides _build_prompts() to convert text into pre-computed embeddings (prompt_embeds), which are passed directly to the pipeline instead of raw text.

CLIPTokenLimitMixin (Legacy)

The original approach to CLIP’s 77-token limit. Truncates prompts to fit within the token budget while prioritizing LoRA trigger words. Superseded by CompelPromptMixin for new models, but still available.

DebugLoggingMixin

Provides a _debug_print() method gated behind the enable_debug_logging configuration flag. Used by turbo models for diagnostic output during development.

Concrete Models

Each concrete model is a thin subclass — typically 20–70 lines — that only overrides what differs from the base:

Model	Pipeline	Parents	Lines	Key Overrides
ZImageTurboModel	`ZImagePipeline`	DebugLogging + Base	~108	Custom device optimizations, step callback, LoRA VAE fixes
FluxModel	`FluxPipeline`	Base only	~22	`_create_pipeline()` only — minimal implementation
QwenImageModel	`QwenImagePipeline`	Base only	~87	8-bit quantization, runtime `true_cfg_scale` detection, empty-negative-prompt fix
SDXLModel	`StableDiffusionXLPipeline`	CLIPTokenLimit + Base	~52	Token-limited prompts, negative prompt support, LoRA VAE fixes
SDXLTurboModel	`StableDiffusionXLPipeline`	CLIPTokenLimit + DebugLogging + Base	~99	Token limit with debug logging, custom MPS optimizations
SD15Model	`StableDiffusionPipeline`	CLIPTokenLimit + Base	~51	Token-limited prompts, negative prompt support, LoRA VAE fixes

A minimal model implementation looks like this:

from diffusers import FluxPipeline
from .base import BaseModel

class FluxModel(BaseModel):
    def _create_pipeline(self):
        return FluxPipeline.from_pretrained(
            self.model_path,
            torch_dtype=self.dtype,
        )

Everything else — device setup, generation loop, LoRA management, metadata building, cache clearing — is inherited from BaseModel.

ModelFactory

ModelFactory.create_model() dispatches on the pipeline field from presets.json:

Pipeline Name	Model Class
`ZImagePipeline`	`ZImageTurboModel`
`FluxPipeline`	`FluxModel`
`QwenImagePipeline`	`QwenImageModel`
`AutoPipelineForText2Image`	`SDXLTurboModel`
`StableDiffusionXLPipeline`	`SDXLModel`
`StableDiffusionPipeline`	`SD15Model`

Configuration Flags

Model behavior is driven by flags in data/presets.json under each model’s settings block:

Flag	Effect
`force_default_guidance`	Always use the model’s default `guidance_scale`, ignoring user overrides. Used by turbo models that require CFG 0.0.
`max_sequence_length`	Maximum text encoder context length. `512` for Flux, `256` for Flux 2 Klein.
`load_in_8bit`	Enable 8-bit quantization for reduced VRAM usage. Supported by QwenImageModel.
`use_sequential_cpu_offload`	Use `enable_sequential_cpu_offload()` instead of `enable_model_cpu_offload()`. More memory-efficient but slower.
`enable_vae_slicing`	Process VAE decode in slices for lower peak memory usage.
`enable_debug_logging`	Enable verbose debug output via `DebugLoggingMixin`.
`supports_negative_prompt`	Whether the model accepts negative prompts. Juggernaut XL, Realistic Vision, and Qwen support this.
`scheduler`	Default scheduler class name (e.g., `EulerDiscreteScheduler`). Can be overridden per job.

Warm Cache

The Celery worker maintains a module-level model cache — a dictionary that keeps the most recently used model loaded in memory:

_model_cache = {}  # {slug: model_instance}

This is safe because Celery is configured with the solo pool (single-threaded worker). The cache works as follows:

A generation task calls _load_model_instance(diffusion_model)
If the requested model slug matches the cached model and its pipeline is loaded, the cached instance is returned immediately (warm hit)
If a different model is cached, it is evicted — its pipeline is deleted and the device cache is cleared (torch.mps.empty_cache() or torch.cuda.empty_cache())
The new model is instantiated via ModelFactory, its pipeline is loaded, and it replaces the evicted model in the cache

This ensures:

No redundant loading — consecutive jobs using the same model skip pipeline loading entirely
One model at a time — only one diffusion model occupies GPU memory at any given moment
Clean transitions — eviction explicitly frees device memory before loading the next model

Before loading any diffusion model, the worker also calls _evict_enhancer() to free VRAM occupied by the prompt enhancement LLM or the adaptation pipeline’s LLM, preventing out-of-memory errors.

Device Optimization

BaseModel._apply_device_optimizations() handles three device targets:

MPS (Apple Silicon)

VAE converted to float32 (prevents NaN values in bfloat16 VAE on MPS)
Entire pipeline moved to MPS device
Attention slicing enabled
VAE slicing and tiling enabled for memory efficiency

CUDA

enable_model_cpu_offload() by default (or enable_sequential_cpu_offload() if the flag is set)
Attention slicing enabled

CPU

Pipeline moved to CPU (fallback, no optimizations)

Individual models can override _apply_device_optimizations() for model-specific needs. For example, ZImageTurboModel adds sequential CPU offload and VAE slicing on CUDA, while SDXLTurboModel has a custom MPS optimization path.

Model Comparison

Model	Steps	CFG	Resolution	Architecture	Best For
Z-Image Turbo	9	0.0	1024	zimage	Fast iteration, drafts
Flux.1-dev	28	3.5	1024	flux1	High quality, presentation
Qwen-Image	50	4.5	1328	qwen	Photorealism, fine detail
SDXL Turbo	4	0.0	512	sdxl	Fastest, quick tests
Juggernaut XL v9	30	7.0	1024	sdxl	Photorealism, negative prompts
DreamShaper XL	4	2.0	1024	sdxl	Fast creative, artistic
Realistic Vision v5.1	30	5.0	512x768	sd15	Portraits, low VRAM

Adding a New Model

For most models, this takes four steps:

Create the model file — inherit from BaseModel and override _create_pipeline()
Register in the factory — add to ModelFactory.create_model() in src/cw/lib/models/__init__.py
Add to presets — add a model entry in data/presets.json with appropriate settings flags
Sync to database — run uv run manage.py import_presets

If the model uses CLIP-based text encoding (SDXL/SD15 family), inherit from CompelPromptMixin for long prompt support and prompt weighting. If the model has unusual requirements (runtime parameter detection, special prompt handling), override the relevant hook methods.