Aquileo | ✨ feat(serialize): add to_markdown GFM tree exporter by gaborbernat · Pull Request #191 · tox-dev/turbohtml

gaborbernat · 2026-06-19T01:50:53Z

turbohtml could pull raw .text off a node but had no structured text export, so the common scrape → Markdown pipeline still needed html2text or markdownify as a second dependency. to_markdown() closes that gap: it walks the existing arena tree once in C and emits GitHub-Flavored Markdown covering headings, lists, links, images, emphasis, strikethrough, inline and fenced code, blockquotes, horizontal rules, line breaks, and pipe tables. 📝

The design comes out of a source-level survey across languages — the Python html2text, markdownify, and inscriptis, JohannesKaufmann's Go html-to-markdown, and Rust's htmd, plus the C/C++ layout browsers w3m, lynx, and GNU html2text. They converge on a recursive DOM visit rather than a streaming parse, with the block context threaded through the recursion instead of re-derived by walking parent pointers. Whitespace is the hard part, so a run is never emitted eagerly; the owed space is deferred, and the emphasis markers are deferred with it, which moves a leading or trailing inner space outside the marker rather than producing the invalid **bold ** form. Plain prose is bulk-copied after its first character, the borrow-or-copy fast path htmd uses. Three spots the reference converters get wrong are correct here: an inline code span grows its backtick fence past any backticks in the content, a | inside a table cell is escaped, and a nested ordered list keeps its own counter on the recursion stack rather than a single field that corrupts on nesting.

Python stays a thin method that takes the same per-tree critical section .text and .html use. ⚡ The walk holds no state outside its stack frame, so it needs no mutex on the free-threaded build where Go's converter reaches for one. A new markdown benchmark suite measures to_markdown at 40× to over 100× faster than markdownify and html2text, validated against a markdown-it-py round-trip that renders the output back to HTML and checks no visible text was lost. Output is opinionated GFM with no options; layout-aware to_text(layout=...) (the inscriptis/w3m role) is a natural follow-up that the same walk can grow into.

closes #172

turbohtml could extract raw .text but had no structured text export, so a scrape-to-Markdown pipeline still needed html2text or markdownify as a second dependency. to_markdown() walks the existing arena tree once in C and emits GitHub-Flavored Markdown: headings, lists, links, images, emphasis, strikethrough, inline and fenced code, blockquotes, rules, breaks, and pipe tables. The design follows the field survey (html2text, markdownify, inscriptis, JohannesKaufmann's Go converter, Rust htmd): a recursive DOM visit, not a streaming parse, with block context threaded through the recursion rather than re-derived by walking parent pointers. Whitespace is the hard part, so a run is never emitted eagerly; the owed space is deferred and the emphasis markers are deferred too, which moves a leading or trailing inner space outside the marker instead of producing the invalid form. Plain prose is bulk-copied after its first character. Three places the reference converters get wrong are correct here: a code span grows its backtick fence past the content, a pipe in a table cell is escaped, and a nested ordered list keeps its own counter on the recursion stack. Python stays a thin method that takes the same per-tree critical section .text and .html use; the walk holds no state outside its stack frame, so it needs no mutex on the free-threaded build where Go's converter reaches for one. The benchmark suite measures it 40x to over 100x faster than markdownify and html2text, which each parse and convert in Python. closes tox-dev#172

gaborbernat added enhancement New feature or request area:serializer Serialization, minification, text/markdown export labels Jun 19, 2026

gaborbernat force-pushed the feat/172 branch from 1e4c11a to 7d1f63d Compare June 19, 2026 03:57

gaborbernat force-pushed the feat/172 branch from 7d1f63d to ec40555 Compare June 19, 2026 05:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ feat(serialize): add to_markdown GFM tree exporter#191

✨ feat(serialize): add to_markdown GFM tree exporter#191
gaborbernat wants to merge 1 commit into
tox-dev:mainfrom
gaborbernat:feat/172

gaborbernat commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gaborbernat commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant