Web Content Converters

Relevant source files

This document covers the converters that process web-based content formats in MarkItDown. These converters handle HTML pages, RSS/Atom feeds, EPUB books, Wikipedia articles, YouTube videos, and Bing search results, transforming them into Markdown. All web content converters leverage BeautifulSoup for HTML parsing and a customized version of the markdownify library for HTML-to-Markdown conversion.

For PDF documents, see PDF Converter. For media files (images, audio), see Media Converters. For the base converter interface, see DocumentConverter Interface.

Overview

MarkItDown includes several specialized web content converters:

Converter	File Path	Accepted Formats	Dependencies
`HtmlConverter`	`converters/_html_converter.py`	`.html`, `.htm`, `text/html`, `application/xhtml`	BeautifulSoup
`RssConverter`	`converters/_rss_converter.py`	`.rss`, `.atom`, `.xml` (if RSS/Atom), RSS/Atom MIME types	defusedxml, BeautifulSoup
`YouTubeConverter`	`converters/_youtube_converter.py`	YouTube URLs (`https://www.youtube.com/watch?`)	BeautifulSoup, youtube-transcript-api (optional)
`EpubConverter`	`converters/_epub_converter.py`	`.epub`, `application/epub+zip`	zipfile, defusedxml, HtmlConverter
`WikipediaConverter`	`converters/_wikipedia_converter.py`	Wikipedia URLs (`*.wikipedia.org/`)	BeautifulSoup
`BingSerpConverter`	`converters/_bing_serp_converter.py`	Bing Search URLs (`bing.com/search?q=`)	BeautifulSoup

All converters extend the DocumentConverter base class and produce DocumentConverterResult objects containing Markdown output.

Sources:

Converter Selection and Priority

The following diagram illustrates the selection logic within the accepts() methods of various web converters.

Title: Web Converter Acceptance Logic

Sources:

HTML Converter

HtmlConverter Class

The HtmlConverter class (packages/markitdown/src/markitdown/converters/_html_converter.py21-91) is the foundational web content converter. It accepts HTML content identified by MIME type or file extension and converts it to Markdown using BeautifulSoup and _CustomMarkdownify.

Acceptance Criteria (_html_converter.py23-40):

File extensions: .html, .htm
MIME types: text/html, application/xhtml

Conversion Process (_html_converter.py42-91):

The converter parses the HTML stream using BeautifulSoup with the charset from StreamInfo (defaulting to UTF-8) (_html_converter.py53-54). It removes <script> and <style> tags (_html_converter.py57-58) and extracts the <body> element if present (_html_converter.py61). The final conversion is handled by _CustomMarkdownify.

If a RecursionError occurs (common in deeply nested HTML), the converter falls back to iterative plain-text extraction via BeautifulSoup.get_text() unless the strict parameter is set to True (_html_converter.py68-81).

Convenience Method:

HtmlConverter.convert_string() (_html_converter.py93-110) provides a convenience method for converting HTML strings directly, which is used internally by other converters like EpubConverter (_epub_converter.py108-115).

Sources:

packages/markitdown/src/markitdown/converters/_html_converter.py21-110

RSS and Atom Feed Converter

RssConverter Class

The RssConverter class (packages/markitdown/src/markitdown/converters/_rss_converter.py29-192) handles RSS 2.0 and Atom feed formats. It parses XML feed structures and converts them to hierarchical Markdown documents.

Acceptance Criteria:

The converter uses a two-tier strategy:

Precise Indicators (_rss_converter.py10-17): .rss, .atom extensions or specific RSS/Atom MIME types.
Candidate Indicators with Content Inspection (_rss_converter.py19-26): .xml files or generic XML MIME types. For these, _check_xml() (_rss_converter.py63-72) parses the XML to find <rss> or <feed> tags.

Feed Type Detection and Parsing

The _feed_type() method (_rss_converter.py74-82) distinguishes between RSS and Atom. RSS feeds are parsed via _parse_rss_type() (_rss_converter.py133-168) which extracts <channel> metadata and <item> entries. Atom feeds are parsed via _parse_atom_type() (_rss_converter.py101-131) which extracts <feed> metadata and <entry> elements.

HTML-styled content within feed items is converted to Markdown using _CustomMarkdownify via the _parse_content() helper (_rss_converter.py170-177).

Sources:

packages/markitdown/src/markitdown/converters/_rss_converter.py29-192

YouTube Converter

YouTubeConverter Class

The YouTubeConverter class (packages/markitdown/src/markitdown/converters/_youtube_converter.py37-239) extracts video title, description, and transcript from YouTube watch pages.

Acceptance Criteria (_youtube_converter.py40-68): It accepts URLs starting with https://www.youtube.com/watch?.

Metadata and Transcript Extraction:

Metadata: Reads <meta> tags (_youtube_converter.py80-96) and searches ytInitialData in scripts (_youtube_converter.py99-116) for the description.
Transcript: If youtube_transcript_api is installed, it fetches the transcript using the video ID (_youtube_converter.py147-189). It supports retry logic (_youtube_converter.py164-170) and automatic translation (_youtube_converter.py182-187).

Sources:

packages/markitdown/src/markitdown/converters/_youtube_converter.py37-239

Specialized Web Converters

WikipediaConverter

The WikipediaConverter (packages/markitdown/src/markitdown/converters/_wikipedia_converter.py20-88) provides cleaner output for Wikipedia pages by isolating the main document content.

Acceptance: Matches URLs like https://en.wikipedia.org/wiki/... (_wikipedia_converter.py37-39).
Extraction: It specifically targets the div with ID mw-content-text (_wikipedia_converter.py66) and the page title from span.mw-page-title-main (_wikipedia_converter.py67).

BingSerpConverter

The BingSerpConverter (packages/markitdown/src/markitdown/converters/_bing_serp_converter.py23-121) extracts organic search results from Bing Search Engine Results Pages (SERPs).

Acceptance: Matches URLs starting with https://www.bing.com/search?q= (_bing_serp_converter.py43-45).
Processing: It targets elements with class b_algo (_bing_serp_converter.py83).
URL Rewriting: It decodes redirect URLs found in the u parameter of links, which are base64-encoded (_bing_serp_converter.py88-106).

Sources:

EPUB Converter

EpubConverter Class

The EpubConverter class (packages/markitdown/src/markitdown/converters/_epub_converter.py26-147) processes EPUB files by extracting their internal HTML structure.

Conversion Process (_epub_converter.py53-131):

Metadata: Parses content.opf (_epub_converter.py69) to extract Dublin Core metadata (title, authors, etc.) using minidom.
Spine: Identifies the reading order of files from the <spine> element and its <itemref> children (_epub_converter.py87-88).
Sequential Conversion: Iterates through the spine files and uses an internal HtmlConverter instance to convert each XHTML/HTML file to Markdown (_epub_converter.py102-116).

Sources:

packages/markitdown/src/markitdown/converters/_epub_converter.py26-147

Custom Markdownify Integration

_CustomMarkdownify Class

The _CustomMarkdownify class (packages/markitdown/src/markitdown/converters/_markdownify.py8-127) extends markdownify.MarkdownConverter to enforce specific formatting rules for MarkItDown.

Key Customizations:

Heading Style: Forced to ATX style (#, ##) by default (_markdownify.py19).
Link Safety: Filters out non-HTTP/HTTPS/File schemes and removes JavaScript hyperlinks (_markdownify.py46-66).
Image Handling: Truncates large data: URIs in <img> tags to prevent Markdown bloat (_markdownify.py107-108).
Input Elements: Converts <input type="checkbox"> to [ ] or [x] (_markdownify.py112-123).

Sources:

packages/markitdown/src/markitdown/converters/_markdownify.py8-127

Integration with Core System

Title: Mapping Web Concepts to Code Entities

Sources:

Web Content Converters

Overview

Converter Selection and Priority

HTML Converter

HtmlConverter Class

RSS and Atom Feed Converter

RssConverter Class

Feed Type Detection and Parsing

YouTube Converter

YouTubeConverter Class

Specialized Web Converters

WikipediaConverter

BingSerpConverter

EPUB Converter

EpubConverter Class

Custom Markdownify Integration

_CustomMarkdownify Class

Integration with Core System

On this page