MarkItDown is a lightweight Python utility for converting various document formats to Markdown, specifically optimized for ingestion by Large Language Models (LLMs) and text analysis pipelines. The system prioritizes preserving document structure (headings, lists, tables, links) in Markdown format while maintaining token efficiency for LLM processing.
The conversion engine supports over 20 file formats including office documents (DOCX, XLSX, PPTX), PDFs, web content (HTML, RSS, YouTube), media files (images, audio, video), and specialized formats (MSG, ZIP, EPUB). The architecture employs a modular converter registry with priority-based selection, optional external service integrations (Azure Document Intelligence, Azure Content Understanding, LLM captioning), and a plugin system for third-party extensions.
For architectural details, see Architecture. For installation and setup instructions, see Installation and Setup. For CLI usage, see Command Line Interface.
Sources: README.md1-25 packages/markitdown/src/markitdown/__about__.py4
MarkItDown converts documents to Markdown because mainstream LLMs natively "speak" Markdown, having been trained on vast amounts of Markdown-formatted text README.md29-34 The format is token-efficient and preserves structural semantics (headings, lists, tables) while remaining close to plain text. Unlike high-fidelity conversion tools designed for human consumption, MarkItDown prioritizes programmatic text analysis over visual presentation README.md10
The system distinguishes itself from alternatives like textract by focusing on structure preservation rather than raw text extraction README.md10 Output maintains hierarchical organization, table layouts, and link relationships, making it suitable for downstream semantic analysis, retrieval-augmented generation (RAG) pipelines, and document understanding workflows.
Sources: README.md10 README.md27-34
The repository is organized as a monorepo containing the core library and several specialized extension packages:
| Package | Purpose |
|---|---|
markitdown | Core library and CLI tool. |
markitdown-mcp | Model Context Protocol server for AI assistants. |
markitdown-ocr | Plugin adding OCR support via LLM Vision to PDF/Office converters. |
markitdown-sample-plugin | Reference implementation for third-party plugin development. |
Sources: README.md114-165 packages/markitdown/README.md1-6
The system implements a three-tier architecture separating user interfaces, core orchestration, and format-specific conversion logic.
Figure 1: Three-Tier Architecture showing separation between interfaces, orchestration, and conversion logic
Sources: README.md73-89 README.md146-156 packages/markitdown/README.md35-43
The system provides three entry points for different use cases:
| Interface | Entry Point | Use Case |
|---|---|---|
| CLI | markitdown command | Command-line batch processing, shell pipelines README.md73-89 |
| Python API | MarkItDown class instantiation | Programmatic integration, custom workflows packages/markitdown/README.md35-43 |
| MCP Server | markitdown-mcp package | AI assistant integration (Claude Desktop, etc.) README.md164 |
All interfaces converge on the MarkItDown class, which serves as the central orchestrator for conversion operations.
Sources: README.md73-89 packages/markitdown/README.md35-43
The MarkItDown class implements the conversion orchestration logic:
Figure 2: Conversion orchestration flow through MarkItDown class methods
Key orchestration components:
convert(): Entry point accepting paths, URIs, or streams.convert_stream(): Processes binary file-like objects.convert_local(): Core logic for converter selection and execution.StreamInfo: Metadata container used to route files to the correct converter.Sources: README.md146-156 packages/markitdown/README.md35-43
The converter ecosystem supports multiple format categories, many of which require optional dependencies:
| Category | File Types | Feature Group |
|---|---|---|
| Office Documents | .docx, .pptx, .xlsx, .xls | [docx], [pptx], [xlsx], [xls] |
| PDF Documents | .pdf | [pdf] |
| Cloud Services | Azure Document Intelligence, Azure Content Understanding | [az-doc-intel], [az-content-understanding] |
| Media Files | Images (.jpg, .png), Audio (.wav, .mp3), YouTube | [audio-transcription], [youtube-transcription] |
| Messaging | .msg (Outlook) | [outlook] |
Sources: README.md12-25 README.md100-112
MarkItDown integrates with optional external services for enhanced conversion capabilities:
Figure 3: External service integration architecture
| Service | Configuration Parameters | Use Case |
|---|---|---|
| LLM API | llm_client, llm_model | Image captioning and OCR for images/PDFs README.md146-153 |
| Azure Content Understanding | cu_endpoint, cu_analyzer_id | Multi-modal support (video, audio) and structured YAML front matter README.md162-175 |
| Azure Document Intelligence | docintel_endpoint | High-quality document layout analysis. |
Sources: README.md130-175