This document covers the converters that process web-based content formats in MarkItDown. These converters handle HTML pages, RSS/Atom feeds, EPUB books, Wikipedia articles, YouTube videos, and Bing search results, transforming them into Markdown. All web content converters leverage BeautifulSoup for HTML parsing and a customized version of the markdownify library for HTML-to-Markdown conversion.
For PDF documents, see PDF Converter. For media files (images, audio), see Media Converters. For the base converter interface, see DocumentConverter Interface.
MarkItDown includes several specialized web content converters:
| Converter | File Path | Accepted Formats | Dependencies |
|---|---|---|---|
HtmlConverter | converters/_html_converter.py | .html, .htm, text/html, application/xhtml | BeautifulSoup |
RssConverter | converters/_rss_converter.py | .rss, .atom, .xml (if RSS/Atom), RSS/Atom MIME types | defusedxml, BeautifulSoup |
YouTubeConverter | converters/_youtube_converter.py | YouTube URLs (https://www.youtube.com/watch?) | BeautifulSoup, youtube-transcript-api (optional) |
EpubConverter | converters/_epub_converter.py | .epub, application/epub+zip | zipfile, defusedxml, HtmlConverter |
WikipediaConverter | converters/_wikipedia_converter.py | Wikipedia URLs (*.wikipedia.org/) | BeautifulSoup |
BingSerpConverter | converters/_bing_serp_converter.py | Bing Search URLs (bing.com/search?q=) | BeautifulSoup |
All converters extend the DocumentConverter base class and produce DocumentConverterResult objects containing Markdown output.
Sources:
The following diagram illustrates the selection logic within the accepts() methods of various web converters.
Title: Web Converter Acceptance Logic
Sources:
The HtmlConverter class (packages/markitdown/src/markitdown/converters/_html_converter.py21-91) is the foundational web content converter. It accepts HTML content identified by MIME type or file extension and converts it to Markdown using BeautifulSoup and _CustomMarkdownify.
Acceptance Criteria (_html_converter.py23-40):
.html, .htmtext/html, application/xhtmlConversion Process (_html_converter.py42-91):
The converter parses the HTML stream using BeautifulSoup with the charset from StreamInfo (defaulting to UTF-8) (_html_converter.py53-54). It removes <script> and <style> tags (_html_converter.py57-58) and extracts the <body> element if present (_html_converter.py61). The final conversion is handled by _CustomMarkdownify.
If a RecursionError occurs (common in deeply nested HTML), the converter falls back to iterative plain-text extraction via BeautifulSoup.get_text() unless the strict parameter is set to True (_html_converter.py68-81).
Convenience Method:
HtmlConverter.convert_string() (_html_converter.py93-110) provides a convenience method for converting HTML strings directly, which is used internally by other converters like EpubConverter (_epub_converter.py108-115).
Sources:
The RssConverter class (packages/markitdown/src/markitdown/converters/_rss_converter.py29-192) handles RSS 2.0 and Atom feed formats. It parses XML feed structures and converts them to hierarchical Markdown documents.
Acceptance Criteria:
The converter uses a two-tier strategy:
.rss, .atom extensions or specific RSS/Atom MIME types..xml files or generic XML MIME types. For these, _check_xml() (_rss_converter.py63-72) parses the XML to find <rss> or <feed> tags.The _feed_type() method (_rss_converter.py74-82) distinguishes between RSS and Atom. RSS feeds are parsed via _parse_rss_type() (_rss_converter.py133-168) which extracts <channel> metadata and <item> entries. Atom feeds are parsed via _parse_atom_type() (_rss_converter.py101-131) which extracts <feed> metadata and <entry> elements.
HTML-styled content within feed items is converted to Markdown using _CustomMarkdownify via the _parse_content() helper (_rss_converter.py170-177).
Sources:
The YouTubeConverter class (packages/markitdown/src/markitdown/converters/_youtube_converter.py37-239) extracts video title, description, and transcript from YouTube watch pages.
Acceptance Criteria (_youtube_converter.py40-68):
It accepts URLs starting with https://www.youtube.com/watch?.
Metadata and Transcript Extraction:
<meta> tags (_youtube_converter.py80-96) and searches ytInitialData in scripts (_youtube_converter.py99-116) for the description.youtube_transcript_api is installed, it fetches the transcript using the video ID (_youtube_converter.py147-189). It supports retry logic (_youtube_converter.py164-170) and automatic translation (_youtube_converter.py182-187).Sources:
The WikipediaConverter (packages/markitdown/src/markitdown/converters/_wikipedia_converter.py20-88) provides cleaner output for Wikipedia pages by isolating the main document content.
https://en.wikipedia.org/wiki/... (_wikipedia_converter.py37-39).div with ID mw-content-text (_wikipedia_converter.py66) and the page title from span.mw-page-title-main (_wikipedia_converter.py67).The BingSerpConverter (packages/markitdown/src/markitdown/converters/_bing_serp_converter.py23-121) extracts organic search results from Bing Search Engine Results Pages (SERPs).
https://www.bing.com/search?q= (_bing_serp_converter.py43-45).b_algo (_bing_serp_converter.py83).u parameter of links, which are base64-encoded (_bing_serp_converter.py88-106).Sources:
The EpubConverter class (packages/markitdown/src/markitdown/converters/_epub_converter.py26-147) processes EPUB files by extracting their internal HTML structure.
Conversion Process (_epub_converter.py53-131):
content.opf (_epub_converter.py69) to extract Dublin Core metadata (title, authors, etc.) using minidom.<spine> element and its <itemref> children (_epub_converter.py87-88).HtmlConverter instance to convert each XHTML/HTML file to Markdown (_epub_converter.py102-116).Sources:
The _CustomMarkdownify class (packages/markitdown/src/markitdown/converters/_markdownify.py8-127) extends markdownify.MarkdownConverter to enforce specific formatting rules for MarkItDown.
Key Customizations:
#, ##) by default (_markdownify.py19).data: URIs in <img> tags to prevent Markdown bloat (_markdownify.py107-108).<input type="checkbox"> to [ ] or [x] (_markdownify.py112-123).Sources:
Title: Mapping Web Concepts to Code Entities
Sources:
Refresh this wiki