Vertex Transcribe Service

Case overview

An AI-powered media processing platform designed to handle millions of minutes of audio and video content. It takes recordings of educational lectures — in Aramaic and English — and turns them into clean, formatted text with proper script, diacritics, verified references, and timed subtitles. When a video file arrives, the system automatically detects it, extracts the audio track, and routes it through the same transcription pipeline.

Goal: Build a media pipeline capable of processing millions of minutes of content and delivering publication-ready text, subtitles, and HLS streams - with as little manual work as possible. process both audio and video through a single pipeline, transcribe multi-language content with high accuracy, and scale dynamically on Kubernetes to handle batches of 300+ concurrent recordings.

Image

Key project info

Industries

Industries

Educational Content Platforms, Religious Institutions, Media Publishing, E-Learning Companies, Lecture Archives, Academic Content Libraries.

Services

Services

AI Transcription, Video Processing, Audio Extraction, HLS Multi-Bitrate Encoding, Subtitle Generation, Batch Orchestration, Source Reference Verification, Cloud Storage Delivery, Thumbnail & Preview Generation.

Solutions

Solutions

Unified Audio/Video Pipeline, Automatic Format Detection, Multi-Language Transcription, Script Conversion with Diacritics, Silence-Based Chunking, Timestamp Stitching, Religious Reference Verification, Dynamic AI Model Selection.

Technologies

Technologies

Python, FastAPI, Google Vertex AI, Gemini Pro, Gemini Flash, Gemini Flash-Lite, FFmpeg, FFprobe, AWS S3, Google Cloud Storage, Kubernetes, Helm, Docker, ARM Instances, HLS (m3u8), Async Python, Connection Pooling, CI/CD Pipeline.

The challenges

Mixed Language Complexity

Mixed Language Complexity

Audio switches between Aramaic, English, and other languages mid-recording. Special AI prompting and multi-step text processing was needed to apply correct diacritics and formatting throughout.

Video & Audio in One Pipeline

Video & Audio in One Pipeline

The system had to handle both pure audio and video containers. FFprobe-based auto-detection extracts the audio stream from any video format before processing — no user intervention required.

Scale for Millions of Minutes

Scale for Millions of Minutes

Designed from the ground up for massive volume: fully async, parallelized, and Kubernetes-native with proper resource management to absorb spikes in batch load.

Smart Chunking for Long Lecture Audio

Smart Chunking for Long Lecture Audio

Lectures often exceed one hour. Silence-detection chunking splits files naturally, while timestamp stitching reconstructs a seamless continuous timeline without gaps or overlaps.

Smart Retry System for AI Workloads

Smart Retry System for AI Workloads

Hundreds of concurrent AI jobs push provider limits hard. Smart retry logic, adaptive exponential backoff, and queue management keep the pipeline moving without dropped jobs.

AI Model Selection for Cost Efficiency

AI Model Selection for Cost Efficiency

Three AI model tiers — powerful, fast, and lightweight — are selected dynamically based on content length and complexity, delivering up to 60% lower API costs on shorter content.

The process

Every file — whether a raw audio lecture or a full video recording — moves through a single automated pipeline. Eight sequential stages take it from raw input to publication-ready output, with video processing running in parallel so nothing waits on anything else.

Media Detection & Preparation

Media Detection & Preparation

FFprobe identifies whether the file is audio or video. For video, the audio track is extracted automatically. Duration and format analysis then determine the processing strategy.

Silence-Based Audio Splitting

Silence-Based Audio Splitting

Files longer than 20 minutes are split into chunks at natural silence points so no phrase is cut mid-sentence, enabling parallel transcription across all pieces simultaneously.

AI Transcription

AI Transcription

Each chunk is sent to Gemini Pro or Flash — selected by content length — with a structured schema that forces the model to return timestamped text with speaker labels.

Timeline Merging

Timeline Merging

All transcribed pieces are stitched back with correct time offsets into one seamless document, with 99% accurate timestamp alignment across the full recording.

Text Post-Processing

Text Post-Processing

Raw transcription goes through script conversion, diacritics application, formatting cleanup, and religious source citation verification against an external database.

Video Processing (Parallel)

Video Processing (Parallel)

While transcription runs, the video module handles HLS multi-bitrate encoding, thumbnail generation, preview clip creation, and multi-audio-stream handling via FFmpeg.

Subtitles & Summary Generation

Subtitles & Summary Generation

From the final verified text, timed subtitle files (.vtt / .srt) are generated alongside an automatic metadata summary for the content library.

Cloud Delivery

Cloud Delivery

Everything — transcription, subtitles, summary, HLS streams — is uploaded to AWS S3 with links delivered to the content team. 100% data retention even through connection drops.

Solutions

The key features of solution

  • Unified Audio & Video Pipeline FFprobe auto-detects formats. A single entry point handles MP4, MKV, WebM, MOV, and audio with no manual conversion.

  • Multi-Language AI Transcription — Handles English, Aramaic, and mixed-language recordings with prompting that preserves language boundaries and applies correct script conventions.

  • Dynamic AI Model Selection — Pro, Flash, and Flash-Lite tiers are chosen automatically by file length and content type — maximizing accuracy while minimizing API spend.

  • HLS Multi-Bitrate Streaming — Parallel video processing produces adaptive bitrate streams, thumbnails, and preview clips ready for any modern video player.

  • 300+ Concurrent Batch Jobs —Kubernetes-native async architecture handles large batches without blocking. Helm charts manage deployment and scaling on ARM instances.

Image

Results in numbers

Transcription Accuracy

99%

Precision for English and Aramaic audio content with correct script and diacritics applied automatically.

Concurrent Jobs

300+

Transcription jobs processed simultaneously with smart queue management and adaptive backoff.

Lower API Costs

60%

Savings through dynamic model selection — lighter models handle shorter content automatically.

Data Retention

100%

Zero data loss even during connection drops, with automatic sync when connection is restored.

Got millions of minutes to process? Let's build the pipeline!

Tell us your content challenge or book a free consultation - we'll outline a solution tailored to your scale, languages, and delivery requirements.

Message not sent.
Message not sent.
×
Not sure where to begin? We'll help you outline the next steps!
Consent to the processing of personal data
×
Got a challenge? Our team will turn it into a solution.
Consent to the processing of personal data