Research Notes
Background research and reference materials that informed design decisions.
Multilingual LLM Model Selection
The adaptation pipeline requires LLM models that can generate culturally-appropriate text in the target language. Model selection is driven by language coverage, structured generation support, and memory footprint.
Evaluated Models
Model |
Parameters |
Notes |
|---|---|---|
Qwen 2.5 7B Instruct |
7B |
Strong multilingual (29+ languages), best for CJK. Default primary model. |
Qwen 2.5 3B Instruct |
3B |
Lighter variant for prompt enhancement. Used by HFPromptEnhancer. |
CohereForAI Aya Expanse 8B |
8B |
23 languages, strong for African and South Asian languages. |
Mistral Nemo Instruct 12B |
12B |
European language specialist, good for Romance/Germanic. |
Google Gemma-2 9B |
9B |
Broad coverage, good at instruction following. |
Meta Llama 3.1 8B |
8B |
Strong English, adequate multilingual. Alternative model. |
Selection Criteria
Language coverage — must handle the target language well, especially for non-Latin scripts
Structured generation — must work with Outlines library for Pydantic schema constraints
4-bit quantization — must run in ~6 GB VRAM when quantized
Chat template — must have a proper chat template for system/user message formatting
The primary_model field on each Language record maps to the best LLM for that language. Alternative models are listed for fallback.
Adaptation Pipeline Design
The 7-node pipeline was designed to separate concerns and enable independent evaluation of each quality dimension.
Why Separate Evaluation Gates
Early prototypes used a single LLM call for adaptation + evaluation. This failed because:
A single prompt could not reliably check format compliance, cultural sensitivity, concept fidelity, and brand consistency simultaneously
Failures were hard to diagnose — which dimension caused the rejection?
Retry logic could not target specific issues
The current design uses four specialized evaluation nodes, each with its own prompt template and Pydantic schema. When an evaluation fails, the specific failure reason is recorded and the writer node receives targeted feedback for revision.
Retry Budget
Each evaluation gate allows up to 3 retries. This budget was chosen empirically:
1 retry catches most formatting issues (wrong language in descriptions, missing translations)
2 retries handle most cultural adjustments
3 retries is a reasonable maximum before the pipeline should stop and surface the issue for human review
Structured Generation
The pipeline uses the Outlines library to constrain LLM output to Pydantic schemas. This eliminates JSON parsing errors and ensures each node produces well-formed output.
Trade-off: structured generation is slower than free-form generation and requires models that handle constrained decoding well. Larger models (7B+) are significantly more reliable than 3B models for this purpose.
Video Upload Security
Video file uploads are validated through 5 layers of security before processing.
Threat Model
Video files can carry:
Oversized payloads — denial of service via memory/disk exhaustion
Extension spoofing —
.mp4extension with non-video contentMIME type mismatches — content that doesn’t match declared type
Malformed headers — files with invalid or missing magic bytes
Path traversal — filenames with
../or special characters
Validation Chain
Validator |
What It Checks |
|---|---|
|
File does not exceed |
|
Extension is in allowlist ( |
|
Content MIME type (via |
|
First bytes match known video container magic numbers |
|
Strips path components, normalizes Unicode, removes special characters |
All validators run before the file is passed to the video analysis pipeline. Validation is configured via Django settings and can be adjusted per deployment.
Design Decision: the composite VideoFileValidator runs all 5 checks and reports all failures at once (rather than fail-fast), so operators see the complete list of issues.
Model Memory Management
GPU VRAM is the primary constraint. The system manages memory through:
Sequential Execution
The Celery solo pool ensures only one task runs at a time. Combined with module-level model caching, this means:
Only one diffusion model is loaded at a time
Switching models evicts the previous one and frees VRAM
Prompt enhancement and pipeline LLMs are evicted before image generation
Eviction Strategy
Before image generation:
_evict_enhancer()— clears cached HFPromptEnhancer (Qwen 3B)_evict_pipeline_model()— clears PipelineModelLoader singleton (Qwen 7B)gc.collect()+torch.cuda.empty_cache()/torch.mps.empty_cache()
This ensures the full VRAM budget is available for the diffusion model.
Quantization
LLM models support
load_in_4bitfor ~4x memory reduction (viabitsandbytes)Diffusion models support
load_in_8bit(Qwen-Image only)All diffusion models use
bfloat16orfloat16precisionVAE is kept in
float32on Apple Silicon (MPS) to avoid precision issues