Audiobook Pipeline
The Problem
I love books but don't always have time to sit and read. Existing text-to-speech tools solve the "audio" part but miss something crucial: they sound robotic. A tense thriller passage gets the same flat monotone as a tender moment. The result is unlistenable for anything beyond a few minutes.
The question I asked: Can I build a pipeline that reads with emotion — adjusting tone, speed, and voice character based on what the text actually says?
The Approach
I designed a 4-stage pipeline where each stage solves one piece of the puzzle:
- Stage 1 — Extract: pdfplumber pulls raw text from any PDF, handling multi-column layouts, headers, and page numbers
- Stage 2 — Clean: A local LLM (via Ollama) fixes OCR artifacts, removes garbled text, and normalizes formatting without altering meaning
- Stage 3 — Emotion Annotate: Each sentence gets classified into one of 8 emotions (neutral, joy, sadness, anger, fear, surprise, tension, contemplative) with intensity scores, using batched Ollama inference
- Stage 4 — Synthesize: Kokoro-82M generates audio with emotion-aware voice blending — adjusting speaker, speed, and prosody per emotion. Runs on Apple Silicon MPS for GPU-accelerated inference
Key Results
- Sentence-level emotion classification with JSON checkpoint/resume for long books
- Tunable voice parameters per emotion (speed, speaker blend, silence padding)
- Karpathy-style autoresearch framework for autonomous optimization of voice-emotion mappings
- Full Streamlit GUI: upload PDF, watch progress, play audio, see color-coded emotion timeline
- CLI with subcommands for each pipeline stage and full end-to-end runs
Business Value
Accessibility: Converts any text to audio for visually impaired users. Content Production: Publishers and educators can generate narrated versions of documents at near-zero marginal cost. Technical Signal: Demonstrates end-to-end ML system design — data pipeline, model inference, GPU optimization, prompt engineering, and productionization via CLI + web app.