YouTube-to-S3 Media Ingestion Pipeline (Modal / Python)
Overview
A lightweight serverless pipeline that downloads YouTube media and stores it in Amazon S3, deployed as a Modal web endpoint. It deduplicates content using SHA-256 hashing before upload, feeding a downstream transcription/processing bucket.
Why It Exists
Media-processing and transcription workflows need a reliable way to pull source video/audio from YouTube and land it in object storage on demand, without standing up dedicated infrastructure. This tool wraps that ingestion step into a single callable serverless function.
What We Built
A focused Python script (ytdownloader.py) that runs on Modal: it defines a serverless image with pytube, boto3, and requests, exposes a web endpoint, downloads YouTube content, computes a unique SHA-256 hash per file for deduplication, and uploads to an S3 bucket (the okapi-transcribe ingestion bucket). It’s intentionally small, a single-purpose ingestion utility rather than a large system.
Technologies & Approach
Python with pytube for YouTube retrieval, boto3 for S3 uploads, and Modal for serverless deployment and web endpoints. Content hashing avoids re-uploading duplicates. (Note: an early version hard-coded credentials inline, a pattern that would be moved to managed secrets in any production hardening.)
Outcome / Impact
The utility proved out a clean, serverless ingestion step for a media/transcription pipeline: invoke an endpoint, get content reliably landed in S3 with dedup, no servers to manage. It’s a small but reusable building block for larger media-data workflows.
Capabilities Demonstrated
- Serverless function design and deployment with Modal
- Media ingestion (YouTube) into object storage (S3) via boto3
- Content-hash deduplication for idempotent uploads