Vertex Transcribe Service
Case overview
An AI-powered media processing platform designed to handle millions of minutes of audio and video content. It takes recordings of educational lectures — in Aramaic and English — and turns them into clean, formatted text with proper script, diacritics, verified references, and timed subtitles. When a video file arrives, the system automatically detects it, extracts the audio track, and routes it through the same transcription pipeline.
Goal: Build a media pipeline capable of processing millions of minutes of content and delivering publication-ready text, subtitles, and HLS streams - with as little manual work as possible. process both audio and video through a single pipeline, transcribe multi-language content with high accuracy, and scale dynamically on Kubernetes to handle batches of 300+ concurrent recordings.
Key project info
Industries
Educational Content Platforms, Religious Institutions, Media Publishing, E-Learning Companies, Lecture Archives, Academic Content Libraries.
Services
AI Transcription, Video Processing, Audio Extraction, HLS Multi-Bitrate Encoding, Subtitle Generation, Batch Orchestration, Source Reference Verification, Cloud Storage Delivery, Thumbnail & Preview Generation.
Solutions
Unified Audio/Video Pipeline, Automatic Format Detection, Multi-Language Transcription, Script Conversion with Diacritics, Silence-Based Chunking, Timestamp Stitching, Religious Reference Verification, Dynamic AI Model Selection.
Technologies
Python, FastAPI, Google Vertex AI, Gemini Pro, Gemini Flash, Gemini Flash-Lite, FFmpeg, FFprobe, AWS S3, Google Cloud Storage, Kubernetes, Helm, Docker, ARM Instances, HLS (m3u8), Async Python, Connection Pooling, CI/CD Pipeline.
The challenges
The process
Every file — whether a raw audio lecture or a full video recording — moves through a single automated pipeline. Eight sequential stages take it from raw input to publication-ready output, with video processing running in parallel so nothing waits on anything else.
Media Detection & Preparation
FFprobe identifies whether the file is audio or video. For video, the audio track is extracted automatically. Duration and format analysis then determine the processing strategy.
Silence-Based Audio Splitting
Files longer than 20 minutes are split into chunks at natural silence points so no phrase is cut mid-sentence, enabling parallel transcription across all pieces simultaneously.
AI Transcription
Each chunk is sent to Gemini Pro or Flash — selected by content length — with a structured schema that forces the model to return timestamped text with speaker labels.
Timeline Merging
All transcribed pieces are stitched back with correct time offsets into one seamless document, with 99% accurate timestamp alignment across the full recording.
Text Post-Processing
Raw transcription goes through script conversion, diacritics application, formatting cleanup, and religious source citation verification against an external database.
Video Processing (Parallel)
While transcription runs, the video module handles HLS multi-bitrate encoding, thumbnail generation, preview clip creation, and multi-audio-stream handling via FFmpeg.
Subtitles & Summary Generation
From the final verified text, timed subtitle files (.vtt / .srt) are generated alongside an automatic metadata summary for the content library.
Cloud Delivery
Everything — transcription, subtitles, summary, HLS streams — is uploaded to AWS S3 with links delivered to the content team. 100% data retention even through connection drops.
Solutions
The key features of solution
Unified Audio & Video Pipeline — FFprobe auto-detects formats. A single entry point handles MP4, MKV, WebM, MOV, and audio with no manual conversion.
Multi-Language AI Transcription — Handles English, Aramaic, and mixed-language recordings with prompting that preserves language boundaries and applies correct script conventions.
Dynamic AI Model Selection — Pro, Flash, and Flash-Lite tiers are chosen automatically by file length and content type — maximizing accuracy while minimizing API spend.
HLS Multi-Bitrate Streaming — Parallel video processing produces adaptive bitrate streams, thumbnails, and preview clips ready for any modern video player.
300+ Concurrent Batch Jobs —Kubernetes-native async architecture handles large batches without blocking. Helm charts manage deployment and scaling on ARM instances.
Results in numbers
99%
Precision for English and Aramaic audio content with correct script and diacritics applied automatically.
300+
Transcription jobs processed simultaneously with smart queue management and adaptive backoff.
60%
Savings through dynamic model selection — lighter models handle shorter content automatically.
100%
Zero data loss even during connection drops, with automatic sync when connection is restored.