video-assembly

randysalars's avatarfrom randysalars

Assemble final video from audio, images, and subtitles

0stars🔀0forks📁View on GitHub🕐Updated Jan 9, 2026

When & Why to Use This Skill

The Video Assembly skill automates the creation of professional Full HD videos by seamlessly integrating mastered audio, scene images, and VTT subtitles. It streamlines the post-production process using FFmpeg and Python scripts to handle cinematic transitions, precise subtitle synchronization, and section-based image mapping, delivering YouTube-ready MP4 files with minimal manual effort.

Use Cases

  • Automated YouTube Channel Management: Efficiently transform audio-heavy content, such as guided meditations or podcasts, into visually engaging videos with synchronized captions.
  • AI-Driven Content Creation: Combine AI-generated images (from Stable Diffusion or Midjourney) with synthesized voiceovers to produce high-quality video content at scale.
  • Educational and Instructional Media: Rapidly assemble lecture slides and audio recordings into accessible video formats with embedded subtitles for better learner engagement.
  • Social Media Asset Production: Generate polished video clips from static assets and VTT files to maintain a consistent and professional visual presence across digital platforms.
nameVideo Assembly
tier3
load_policytask-specific
descriptionAssemble final video from audio, images, and subtitles
version1.0.0
parent_skillproduction-operations

Video Assembly Skill

The Visual Wrapper for the Audio Journey

This skill handles assembling the final video from mastered audio, scene images, and VTT subtitles.


Purpose

Create YouTube-ready video files that complement the hypnotic audio experience.


Video Standards

Parameter Standard
Resolution 1920x1080 (Full HD)
Aspect Ratio 16:9
Codec H.264
Frame Rate 24 fps (cinematic)
Audio Codec AAC 320kbps
Container MP4

Input Requirements

Input File Required
Master Audio {session}_MASTER.mp3 Yes
Scene Images images/uploaded/*.png Yes
Subtitles output/subtitles.vtt Yes

Scene Image Generation

Primary Method: Stable Diffusion (Default)

python3 scripts/core/generate_scene_images.py sessions/{session}/

Alternative: Midjourney Prompts

python3 scripts/core/generate_scene_images.py sessions/{session}/ --midjourney-only

Alternative: Stock Images

python3 scripts/core/generate_scene_images.py sessions/{session}/ --method stock

Image Specifications

Property Requirement
Resolution 1920x1080 minimum
Format PNG or JPEG
Aspect Ratio 16:9
Naming scene_01.png, scene_02.png, etc.

VTT Subtitle Generation

python3 scripts/ai/vtt_generator.py sessions/{session}

VTT Format

WEBVTT
Kind: captions
Language: en

1
00:00:00.000 --> 00:00:05.500
Welcome to this healing journey.

2
00:00:06.000 --> 00:00:12.000
Find a comfortable position and
allow your eyes to close.

Subtitle Guidelines

Guideline Value
Max lines per caption 2
Max characters per line ~80
Min duration 1.5 seconds
Max duration 7 seconds

Video Assembly Command

python3 scripts/core/assemble_session_video.py sessions/{session}/

This automatically:

  • Sequences images based on script sections
  • Adds cross-fade transitions
  • Syncs subtitles to audio
  • Outputs to output/video/session_final.mp4

Manual FFmpeg Assembly

For custom control:

# Create video from images with audio
ffmpeg -y \
  -framerate 1/10 \
  -pattern_type glob -i 'images/uploaded/*.png' \
  -i output/{session}_MASTER.mp3 \
  -c:v libx264 -r 24 -pix_fmt yuv420p \
  -c:a aac -b:a 320k \
  -shortest \
  output/video/session_final.mp4

Image-to-Section Mapping

Images should correspond to script sections:

Image Section Timing
scene_01.png Pre-Talk 0:00-3:00
scene_02.png Induction 3:00-8:00
scene_03.png Deepening 8:00-12:00
scene_04.png Journey Start 12:00-17:00
scene_05.png Journey Core 17:00-22:00
scene_06.png Helm/Deepest 22:00-25:00
scene_07.png Integration 25:00-28:00
scene_08.png Emergence 28:00-30:00

Transition Effects

Transition Duration Use For
Cross-dissolve 2-3 seconds Section transitions
Fade from black 3 seconds Opening
Fade to black 3 seconds Closing

Output Files

File Location Purpose
Final video output/video/session_final.mp4 Direct use
YouTube copy output/youtube_package/final_video.mp4 Upload ready

Video Overlay Generation

Generate supporting graphics:

python3 scripts/core/generate_video_images.py sessions/{session}/ --all

Creates:

  • title_card.png - Video intro screen
  • sections/section_*.png - Chapter transitions
  • outro.png - End screen
  • social_preview.png - Social sharing

Quality Verification

After assembly:

# Check video properties
ffprobe -v error -show_format -show_streams output/video/session_final.mp4

# Play with VLC to verify sync
vlc output/video/session_final.mp4

Quality Checklist

  • Resolution is 1920x1080
  • Frame rate is 24 fps
  • Audio syncs with subtitles
  • Transitions are smooth
  • No visible artifacts
  • Duration matches audio
  • File size reasonable (<2GB for 30 min)

Troubleshooting

Issue Cause Solution
Audio/video desync Different durations Use -shortest flag
Pixelated video Wrong pixel format Use -pix_fmt yuv420p
Green frames Image format issue Convert images to PNG
Subtitle timing off VTT not scaled Regenerate VTT with actual audio duration
File too large Bitrate too high Use -crf 23 for smaller file

Integration with Pipeline

Before (dependencies):

  • Audio mastered ({session}_MASTER.mp3)
  • Scene images ready (images/uploaded/)
  • VTT subtitles generated (output/subtitles.vtt)

After (next steps):

  • YouTube packaging

Related Resources

  • Skill: tier3-production/audio-mixing/ (input)
  • Skill: tier3-production/youtube-packaging/ (next step)
  • Doc: docs/STOCK_IMAGE_SOP.md
  • Script: scripts/core/assemble_session_video.py
video-assembly – AI Agent Skills | Claude Skills