HTML main-content extraction for Rust. Give readex a &str of HTML and
it returns the article body, title, byline, publish date, language, and
~15 other metadata fields — no network I/O, no JavaScript rendering, no
encoding detection. Pure synchronous string and DOM work, suitable for
embedding anywhere from a desktop tool to a server pipeline.
[dependencies]
readex = "0.19"use readex::{extract, Extracted};
let html = r#"
<html>
<head><title>Hello readex</title></head>
<body>
<article>
<h1>Hello readex</h1>
<p>This is the body of an article. It contains enough words
that the extractor will consider it substantive content.</p>
<p>A second paragraph adds more text so the scorer has signal.</p>
</article>
</body>
</html>
"#;
let Extracted { title, text, .. } = extract(html, None).expect("extraction failed");
assert_eq!(title.as_deref(), Some("Hello readex"));
assert!(text.contains("body of an article"));Real web pages come wrapped in navigation, cookie banners, share widgets, and
comment sections. readex strips the chrome and returns just the body and
the metadata it can recover:
use readex::extract;
let html = r#"
<html lang="en">
<head>
<title>Why the bridge collapsed — The Daily Example</title>
<meta property="og:site_name" content="The Daily Example">
<meta name="author" content="Jane Reporter">
<meta property="article:published_time" content="2026-05-24T09:30:00Z">
</head>
<body>
<nav><a href="/">Home</a> <a href="/news">News</a></nav>
<aside class="cookie-banner">We use cookies. <button>OK</button></aside>
<article>
<h1>Why the bridge collapsed</h1>
<p class="byline">By Jane Reporter, 24 May 2026</p>
<p>Investigators arrived on site shortly after dawn and began
sampling the steelwork for fatigue cracks.</p>
<p>The bridge, opened in 1972, had been scheduled for inspection
next month. Engineers say the failure mode is consistent with
corrosion at the western anchorage.</p>
</article>
<section class="comments">
<h3>Comments (412)</h3>
<p>"Knew this would happen" — anonymous</p>
</section>
<footer>© 2026 The Daily Example</footer>
</body>
</html>
"#;
let result = extract(html, Some("https://example.com/news/bridge")).unwrap();
assert_eq!(result.title.as_deref(), Some("Why the bridge collapsed"));
assert_eq!(result.byline.as_deref(), Some("Jane Reporter"));
assert_eq!(result.site_name.as_deref(), Some("The Daily Example"));
assert_eq!(result.language.as_deref(), Some("en"));
assert!(result.published_time.is_some());
assert!(result.text.contains("Investigators arrived on site"));
assert!(!result.text.contains("cookie")); // banner stripped
assert!(!result.text.contains("Home")); // nav stripped
assert!(!result.text.contains("Knew this would")); // comments strippedreadex carries the lineage of three well-validated extractors:
| Origin | Role inside readex |
|---|---|
| Mozilla Readability (JS) | Article-scoring core — the M2 port preserves the full _grabArticle / _prepArticle / flag-sieve pipeline. |
| Trafilatura (Python) | The M3 cascade — own → readability fork → jusText — with the 7-branch arbiter, dedup gate, and sanitize post-pass. |
| htmldate (Python) | Publication-date extraction with the same precedence rules as upstream. |
Each is a clean-room reimplementation in Rust; the upstream Python and JavaScript projects are the differential-test oracles, not vendored code.
| Function | Purpose |
|---|---|
extract |
Default extraction. Returns an Extracted with title, body text, canonical URL, language, byline, excerpt, site name, published time, categories, tags, image, license, hostname, and (optionally) sanitised HTML. |
extract_with |
extract(html, base_url) plus a third &Options parameter (so extract_with(html, base_url, &Options::default()) is exactly equivalent to extract(html, base_url)). Lets you opt into sanitised HTML output, set a minimum word-count threshold, or request a YAML metadata header. |
extract_to_markdown |
Body as Markdown — Trafilatura's output_format="markdown". |
extract_to_txt |
Plain-text body — Trafilatura's output_format="txt". |
extract_to_json / extract_to_csv / extract_to_xml / extract_to_tei |
Structured output formats. |
extract_via_readability |
Forces the M2 Mozilla-Readability path (older, simpler, no Trafilatura cascade). Useful when you specifically need that algorithm's output shape. |
extract and extract_with(.., .., &Options::default()) are byte-identical by
construction — extract is literally a one-line delegate, so the two cannot
drift apart.
There are already a handful of HTML-extraction crates on crates.io. Honest positioning vs. the obvious alternatives:
readex |
readability |
dom_content_extraction |
|
|---|---|---|---|
| Algorithms | Readability + Trafilatura cascade + htmldate | Readability only | DOM-centric (different family) |
| Metadata fields | ~15 (title, byline, language, dates, OG/Twitter/JSON-LD, categories, tags, image, license, hostname …) | Title + summary | Body text only |
| Output formats | text, sanitised HTML, Markdown, TXT, JSON, CSV, XML, TEI | text only | text only |
| Differential parity testing | Yes — 51-URL corpus + 50K broad sweep, every release | No | No |
| Hard pin on parser versions | Yes (html5ever 0.39.0, plus a documented "parser-equivalence fence") |
No | No |
| Edition / MSRV | 2024 / 1.85 | 2018 / older | 2021 / older |
| Comments extraction (Reddit/vBulletin/etc.) | Yes (via Trafilatura) | No | No |
| Date extraction | Yes (via htmldate) | No | No |
If your input is well-structured English-language articles and you want one
algorithm with no extra moving parts, readability may be all you need.
readex exists because real-world corpora (SEC filings, regulator
publications, multilingual news, low-template blogs, hub/index pages) defeat
single-algorithm extractors — the Trafilatura cascade was designed
specifically for that long tail.
readex is developed against a differential-test harness that runs every
benchmark URL through three extractors in parallel — readex, Mozilla
Readability (via Node), and Trafilatura (via Python) — and scores agreement
across token sequences and metadata fields. The harness lives in
benchmark/ in the repo (not published as part of the crate) and is
re-run on every release.
Latest verdicts (as of 0.19.0):
| Gate | Corpus | Result |
|---|---|---|
Trafilatura extract_content (Markdown path) |
51 URLs | 48 / 51 byte-equivalent (41 substantive + 7 documented allowlist) |
| Trafilatura plain-text (TXT) path | 51 URLs | 45 / 51 substantive + 5 allowlist + 1 deferred |
| Trafilatura TEI structured output | 51 URLs | 51 / 51 (39 substantive + 12 allowlist) |
Mozilla Readability textContent |
51 URLs | 50 / 51 byte-equivalent vs. jsdom |
| Parser equivalence (rcdom vs. jsdom) | 51 URLs | 51 / 51 byte-equivalent DOM |
| Broad-sweep confidence (Common Crawl) | 50,000 pages | Tail-distribution scan vs. Python Trafilatura |
The "allowlist" entries are documented per-page divergences where readex
and upstream genuinely disagree for traced reasons (e.g. upstream emits a
cookie banner the page lacks chrome-class hints for; or upstream skips a
table the data-table heuristic rescues). They live under
wrk_docs/m{5,7}-allowlist/ in the repo with one Markdown file per
fixture.
If you find a page where readex disagrees with both Readability and
Trafilatura in a way that matters, please file an issue with the URL or
HTML — the harness will pick it up.
- Network fetching.
readextakes a&str. The caller owns HTTP, redirects, SSRF guarding, and encoding detection. - JavaScript rendering.
readexparses the bytes as given. Pages that need JS to render their body need a headless browser upstream. - PDF extraction. HTML only.
- Streaming. The whole document is parsed at once.
These boundaries keep the crate sync, dependency-light, and easy to embed.
0.19.0 is the first public crates.io release. The API surface is:
- Stable:
extract,extract_with,extract_to_markdown,extract_to_txt,extract_to_json,extract_to_csv,extract_to_xml,extract_to_tei,extract_via_readability, plus theExtracted,Options, andExtractErrortypes they use. #[doc(hidden)]internals:readability::*,trafilatura::*,htmldate::*. These are reachable but explicitly not part of the semver contract — they exist for the in-workspace differential test harness. Treat them as private; they can change at any time.
readex is at 0.x — additive minor bumps may add fields to Extracted or
Options; breaking changes (renames, signature changes) will only land on
a 0.X.0 boundary with a clear changelog entry.
readex targets Rust 1.85+ (Rust 2024 edition).
Issues and PRs welcome at https://github.com/0x4D44/readex. For
non-trivial changes, please open an issue first so we can discuss the
approach — readex is gated by a parity-test harness against Readability
and Trafilatura, and the cheapest path through that gate is usually a
quick sketch of intent before code.
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
- MIT License (LICENSE-MIT or https://opensource.org/licenses/MIT)
at your option.
Unless you explicitly state otherwise, any contribution intentionally
submitted for inclusion in readex by you, as defined in the Apache-2.0
license, shall be dual-licensed as above, without any additional terms or
conditions.
See NOTICE for attribution to the upstream Readability,
Trafilatura, and htmldate projects whose algorithms readex ports.