Overview

Relevant source files

MarkItDown is a lightweight Python utility for converting various document formats to Markdown, specifically optimized for ingestion by Large Language Models (LLMs) and text analysis pipelines. The system prioritizes preserving document structure (headings, lists, tables, links) in Markdown format while maintaining token efficiency for LLM processing.

The conversion engine supports over 20 file formats including office documents (DOCX, XLSX, PPTX), PDFs, web content (HTML, RSS, YouTube), media files (images, audio, video), and specialized formats (MSG, ZIP, EPUB). The architecture employs a modular converter registry with priority-based selection, optional external service integrations (Azure Document Intelligence, Azure Content Understanding, LLM captioning), and a plugin system for third-party extensions.

For architectural details, see Architecture. For installation and setup instructions, see Installation and Setup. For CLI usage, see Command Line Interface.

Sources: README.md1-25 packages/markitdown/src/markitdown/__about__.py4

System Purpose and Design Philosophy

MarkItDown converts documents to Markdown because mainstream LLMs natively "speak" Markdown, having been trained on vast amounts of Markdown-formatted text README.md29-34 The format is token-efficient and preserves structural semantics (headings, lists, tables) while remaining close to plain text. Unlike high-fidelity conversion tools designed for human consumption, MarkItDown prioritizes programmatic text analysis over visual presentation README.md10

The system distinguishes itself from alternatives like textract by focusing on structure preservation rather than raw text extraction README.md10 Output maintains hierarchical organization, table layouts, and link relationships, making it suitable for downstream semantic analysis, retrieval-augmented generation (RAG) pipelines, and document understanding workflows.

Sources: README.md10 README.md27-34

Monorepo Structure

The repository is organized as a monorepo containing the core library and several specialized extension packages:

Package	Purpose
`markitdown`	Core library and CLI tool.
`markitdown-mcp`	Model Context Protocol server for AI assistants.
`markitdown-ocr`	Plugin adding OCR support via LLM Vision to PDF/Office converters.
`markitdown-sample-plugin`	Reference implementation for third-party plugin development.

Sources: README.md114-165 packages/markitdown/README.md1-6

Three-Tier Architecture

The system implements a three-tier architecture separating user interfaces, core orchestration, and format-specific conversion logic.

Architecture Overview Diagram

Figure 1: Three-Tier Architecture showing separation between interfaces, orchestration, and conversion logic

Sources: README.md73-89 README.md146-156 packages/markitdown/README.md35-43

User Interface Layer

The system provides three entry points for different use cases:

Interface	Entry Point	Use Case
CLI	`markitdown` command	Command-line batch processing, shell pipelines README.md73-89
Python API	`MarkItDown` class instantiation	Programmatic integration, custom workflows packages/markitdown/README.md35-43
MCP Server	`markitdown-mcp` package	AI assistant integration (Claude Desktop, etc.) README.md164

All interfaces converge on the MarkItDown class, which serves as the central orchestrator for conversion operations.

Sources: README.md73-89 packages/markitdown/README.md35-43

Core Orchestration Layer

The MarkItDown class implements the conversion orchestration logic:

Figure 2: Conversion orchestration flow through MarkItDown class methods

Key orchestration components:

convert(): Entry point accepting paths, URIs, or streams.
convert_stream(): Processes binary file-like objects.
convert_local(): Core logic for converter selection and execution.
StreamInfo: Metadata container used to route files to the correct converter.

Sources: README.md146-156 packages/markitdown/README.md35-43

Format Coverage and Dependencies

The converter ecosystem supports multiple format categories, many of which require optional dependencies:

Category	File Types	Feature Group
Office Documents	`.docx`, `.pptx`, `.xlsx`, `.xls`	`[docx]`, `[pptx]`, `[xlsx]`, `[xls]`
PDF Documents	`.pdf`	`[pdf]`
Cloud Services	Azure Document Intelligence, Azure Content Understanding	`[az-doc-intel]`, `[az-content-understanding]`
Media Files	Images (`.jpg`, `.png`), Audio (`.wav`, `.mp3`), YouTube	`[audio-transcription]`, `[youtube-transcription]`
Messaging	`.msg` (Outlook)	`[outlook]`

Sources: README.md12-25 README.md100-112

External Service Integration

MarkItDown integrates with optional external services for enhanced conversion capabilities:

Figure 3: External service integration architecture

Integration Modes

Service	Configuration Parameters	Use Case
LLM API	`llm_client`, `llm_model`	Image captioning and OCR for images/PDFs README.md146-153
Azure Content Understanding	`cu_endpoint`, `cu_analyzer_id`	Multi-modal support (video, audio) and structured YAML front matter README.md162-175
Azure Document Intelligence	`docintel_endpoint`	High-quality document layout analysis.

Sources: README.md130-175