← All work
Tooling · 2023

YouTube-to-S3 Media Ingestion Pipeline (Modal / Python)

Overview

A lightweight serverless pipeline that downloads YouTube media and stores it in Amazon S3, deployed as a Modal web endpoint. It deduplicates content using SHA-256 hashing before upload, feeding a downstream transcription/processing bucket.

Why It Exists

Media-processing and transcription workflows need a reliable way to pull source video/audio from YouTube and land it in object storage on demand, without standing up dedicated infrastructure. This tool wraps that ingestion step into a single callable serverless function.

What We Built

A focused Python script (ytdownloader.py) that runs on Modal: it defines a serverless image with pytube, boto3, and requests, exposes a web endpoint, downloads YouTube content, computes a unique SHA-256 hash per file for deduplication, and uploads to an S3 bucket (the okapi-transcribe ingestion bucket). It’s intentionally small, a single-purpose ingestion utility rather than a large system.

Technologies & Approach

Python with pytube for YouTube retrieval, boto3 for S3 uploads, and Modal for serverless deployment and web endpoints. Content hashing avoids re-uploading duplicates. (Note: an early version hard-coded credentials inline, a pattern that would be moved to managed secrets in any production hardening.)

Outcome / Impact

The utility proved out a clean, serverless ingestion step for a media/transcription pipeline: invoke an endpoint, get content reliably landed in S3 with dedup, no servers to manage. It’s a small but reusable building block for larger media-data workflows.

Capabilities Demonstrated

  • Serverless function design and deployment with Modal
  • Media ingestion (YouTube) into object storage (S3) via boto3
  • Content-hash deduplication for idempotent uploads
More work See all →