Audiobook Pipeline

PDF in, emotionally narrated audiobook out — a 4-stage NLP+TTS pipeline that reads books the way humans do.

Flagship Project PyTorch Kokoro-82M Ollama Streamlit

The Problem

I love books but don't always have time to sit and read. Existing text-to-speech tools solve the "audio" part but miss something crucial: they sound robotic. A tense thriller passage gets the same flat monotone as a tender moment. The result is unlistenable for anything beyond a few minutes.

The question I asked: Can I build a pipeline that reads with emotion — adjusting tone, speed, and voice character based on what the text actually says?

The Approach

I designed a 4-stage pipeline where each stage solves one piece of the puzzle:

📄

Extract

pdfplumber

→

🤖

Clean

Ollama LLM

→

💬

Emotion

8-class sentiment

→

🎧

Synthesize

Kokoro-82M + MPS

Stage 1 — Extract: pdfplumber pulls raw text from any PDF, handling multi-column layouts, headers, and page numbers
Stage 2 — Clean: A local LLM (via Ollama) fixes OCR artifacts, removes garbled text, and normalizes formatting without altering meaning
Stage 3 — Emotion Annotate: Each sentence gets classified into one of 8 emotions (neutral, joy, sadness, anger, fear, surprise, tension, contemplative) with intensity scores, using batched Ollama inference
Stage 4 — Synthesize: Kokoro-82M generates audio with emotion-aware voice blending — adjusting speaker, speed, and prosody per emotion. Runs on Apple Silicon MPS for GPU-accelerated inference

Key Results

Pipeline Stages

Emotion Classes

82M

TTS Model Params

MPS

GPU Accelerated

Sentence-level emotion classification with JSON checkpoint/resume for long books
Tunable voice parameters per emotion (speed, speaker blend, silence padding)
Karpathy-style autoresearch framework for autonomous optimization of voice-emotion mappings
Full Streamlit GUI: upload PDF, watch progress, play audio, see color-coded emotion timeline
CLI with subcommands for each pipeline stage and full end-to-end runs

Business Value

Accessibility: Converts any text to audio for visually impaired users. Content Production: Publishers and educators can generate narrated versions of documents at near-zero marginal cost. Technical Signal: Demonstrates end-to-end ML system design — data pipeline, model inference, GPU optimization, prompt engineering, and productionization via CLI + web app.

Tech Stack

Python 3.10+ PyTorch (MPS) Kokoro-82M Ollama pdfplumber Streamlit Ruff Pytest GitHub Actions CI

← Back to all projects