PDF Chapter Extractor

A robust engineering solution for document processing, the PDF Chapter Extractor automates the tedious task of breaking down large PDF files into logical, searchable chapters.

Technical Highlights

Multi-Method Detection: Implements a layered detection strategy:
- AI-Powered Discovery: Leverages gemini-3-flash to analyze document structure and identify non-standard chapter boundaries where TOC is missing.
- TOC Parsing: Direct extraction from embedded PDF metadata for high-fidelity accuracy.
- Manual Precision: Support for custom page ranges with intelligent page numbering offset handling.
Cross-Platform GUI: Built with a performant Tkinter interface optimized for macOS, providing real-time logging and interactive tree-view selection.
Optimized Output: Uses PyMuPDF's low-level garbage collection and cleaning algorithms to ensure extracted chapters are smaller and faster to load than the original document.

Core Capabilities

🌳 Interactive Tree View: Preview detected chapters and select exactly what you need.
🤖 Gemini 3 Integration: Uses the latest generative AI to "read" documents and find transition points between sections.
📦 Docker Ready: Designed for consistent execution across environments with localized environment management.
⚡ Performance: Highly optimized for large files (1000+ pages) with minimal memory footprint.