Video Upload & Analysis
This guide covers uploading video files, the security validation pipeline, and the 10-phase video analysis process.
Overview
The video analysis pipeline extracts structured data from uploaded video files — scenes, transcription, objects, visual style, sentiment, and audience insights — and produces a VideoProcessingResult that can be used to create origin ad units.
graph TD
Upload[Upload Video] --> Validate[Security Validation]
Validate --> Metadata[1. Extract Metadata]
Metadata --> Scenes[2. Detect Scenes]
Scenes --> Transcribe[3. Transcribe Audio]
Transcribe --> Keyframes[4. Extract Keyframes]
Keyframes --> Objects[5. Detect Objects]
Objects --> Style[6. Analyze Visual Style]
Style --> Sentiment[7. Analyze Sentiment]
Sentiment --> Categories[8. Categorize Scenes]
Categories --> Script[9. Generate Script]
Script --> Insights[10. Audience Insights]
Insights --> Result[VideoProcessingResult]
Uploading a Video
Videos are uploaded via the Ad Unit Media model, linked to a Campaign:
Navigate to TV Spots > Ad Unit Media > Add Ad Unit Media
Select the Campaign
Upload an MP4 video file
Save — the
analyze_video_taskis automatically queued
The upload process runs security validation before saving (see below).
Security Validation
The VideoFileValidator runs five validators in sequence on every upload:
Validator |
What It Checks |
|---|---|
FileSizeValidator |
File size between 1 byte and 500 MB (configurable via |
FileExtensionValidator |
Extension in whitelist: |
MimeTypeValidator |
Detected MIME type (via |
FileHeaderValidator |
File magic bytes match known video signatures (MP4 ftyp box, AVI RIFF, MKV/WebM EBML header) |
FilenameSanitizer |
Removes path traversal patterns ( |
All validation failures are logged to the cw.lib.security logger for security auditing. If any validator fails, the upload is rejected with a combined error message.
Analysis Phases
The analyze_video_task orchestrates 10 analysis phases with progress tracking:
Phase 1: Extract Metadata
Extracts video metadata using PyAV:
Duration, resolution (width/height), frame rate
Audio channels and sample rate
File size
Progress: 0–10%
Phase 2: Scene Detection
Detects scene boundaries using PySceneDetect:
Identifies scene transitions (cuts, fades, dissolves)
Returns scene list with start/end times and duration
Progress: 10–30%
Phase 3: Audio Transcription
Transcribes audio using Whisper (Large v3):
Full transcription with timestamps per segment
Language detection and confidence score
Speaker identification
Progress: 30–50%
Phase 4: Keyframe Extraction & Object Detection
Extracts one keyframe per scene (middle frame) and runs YOLO v8x object detection:
Keyframes saved as JPEG images
Object detection with bounding boxes and confidence scores
Objects summarized across all frames (counts, most common classes)
Progress: 50–70%
Phase 5: Visual Style Analysis
Analyzes visual style from keyframes using OpenCV with k-means clustering:
Dominant color palette (hex codes)
Average brightness and contrast
Lighting distribution (soft, harsh, dramatic)
Exposure distribution (normal, overexposed, underexposed)
Phase 6: Camera Work Analysis
Derived from scene data:
Average scene duration and pacing (fast/medium/slow)
Total scene count and number of transitions
Phase 7: Sentiment Analysis
Combines text and visual signals:
Text sentiment — keyword-based analysis of transcription
Visual sentiment — derived from brightness, color warmth, and detected objects
Overall sentiment score and confidence
Phase 8: Scene Categorization
Categorizes each scene into one of 10 categories based on detected objects, transcription content, and visual features:
People, product, lifestyle, action, nature, urban, food, technology, abstract, other
Primary categories identified from frequency across scenes
Phase 9: Script Generation
Generates a structured script by mapping transcription segments to detected scenes:
Each scene gets a visual description and associated voiceover text
Output format matches the TV spot JSON schema (scene number, duration, visual, audio)
Phase 10: Audience Insights
Generates audience targeting insights using an LLM (Qwen 2.5):
Primary audience demographics and psychographics
Secondary audience segments
Market potential analysis (high-fit markets, considerations)
Schema-validated output via structured generation
Progress: 90–100%
VideoProcessingResult
All analysis output is stored in a single VideoProcessingResult record with JSON fields:
Field |
Contents |
|---|---|
|
Detected scenes with timestamps, visual descriptions, objects, colors, sentiment |
|
Generated script in TV spot JSON format |
|
Full transcription with timestamps, language, confidence, segments |
|
Color palette, brightness, contrast, lighting, camera work |
|
Aggregated object detection: total count, per-class counts and confidence |
|
Overall sentiment, text sentiment, visual sentiment, scores |
|
Scene categorization summary with primary categories |
|
LLM-generated demographics, psychographics, market potential |
|
Total processing time in seconds |
|
Which model/library was used for each analysis phase |
KeyFrame Records
Each extracted keyframe is saved as a KeyFrame record linked to the result:
Scene number and timestamp
Saved image file
Per-frame object detection results (label, confidence, bounding box)
Dominant colors for the specific frame
Embedding field (reserved for future similarity search)
AdUnitMedia Lifecycle
Status |
Description |
|---|---|
|
Upload not yet started |
|
File uploaded, awaiting processing |
|
Analysis task running |
|
Analysis complete, result attached |
|
Analysis failed (check |
|
Results reviewed and approved |
The task includes retry logic with exponential backoff (up to 3 retries, starting at 60 seconds).
Viewing Results
After processing completes:
View the Ad Unit Media record — metadata fields (duration, resolution, etc.) are populated
Click through to the linked Video Processing Result for full analysis data
View Key Frames as inline images with object detection overlay data
Use the result to inform the creation of an origin ad unit with accurate script rows