Command-line toolset for extracting text from files (documents, images, archives) into SQLite with OCR support.
Simple, expandable, one shell script only.

Features

  • Multi-format text extraction from 30+ file types including documents, spreadsheets, presentations, and archives
  • OCR (Optical Character Recognition) support for extracting text from images and scanned documents
  • Recursive archive processing - automatically extracts and processes files nested within ZIP, TAR, GZIP, and other archive formats
  • SQLite database integration - stores all extracted text in a searchable SQLite database for fast queries
  • Command-line interface - easy integration into scripts and automated workflows
  • Batch processing - process entire directories with a single command
  • Line-level granularity - extracts text with line numbers for precise referencing
  • Configurable OCR - supports multiple languages and quality settings
  • LibreOffice integration - uses headless LibreOffice for reliable document conversion
  • Lightweight - shell script implementation with minimal dependencies
  • Cross-platform - runs on Linux, macOS, and Windows (via WSL/Cygwin)
  • Transaction-safe database updates - uses SQL transactions for data integrity
  • Progress tracking - detailed output for monitoring extraction progress
  • Error handling - continues processing even if individual files fail

Project Activity

See All Activity >