Skip to content

✨ feat(serialize): add to_markdown GFM tree exporter#191

Open
gaborbernat wants to merge 1 commit into
tox-dev:mainfrom
gaborbernat:feat/172
Open

✨ feat(serialize): add to_markdown GFM tree exporter#191
gaborbernat wants to merge 1 commit into
tox-dev:mainfrom
gaborbernat:feat/172

Conversation

@gaborbernat

Copy link
Copy Markdown
Member

turbohtml could pull raw .text off a node but had no structured text export, so the common scrapeMarkdown pipeline still needed html2text or markdownify as a second dependency. to_markdown() closes that gap: it walks the existing arena tree once in C and emits GitHub-Flavored Markdown covering headings, lists, links, images, emphasis, strikethrough, inline and fenced code, blockquotes, horizontal rules, line breaks, and pipe tables. 📝

The design comes out of a source-level survey across languages — the Python html2text, markdownify, and inscriptis, JohannesKaufmann's Go html-to-markdown, and Rust's htmd, plus the C/C++ layout browsers w3m, lynx, and GNU html2text. They converge on a recursive DOM visit rather than a streaming parse, with the block context threaded through the recursion instead of re-derived by walking parent pointers. Whitespace is the hard part, so a run is never emitted eagerly; the owed space is deferred, and the emphasis markers are deferred with it, which moves a leading or trailing inner space outside the marker rather than producing the invalid **bold ** form. Plain prose is bulk-copied after its first character, the borrow-or-copy fast path htmd uses. Three spots the reference converters get wrong are correct here: an inline code span grows its backtick fence past any backticks in the content, a | inside a table cell is escaped, and a nested ordered list keeps its own counter on the recursion stack rather than a single field that corrupts on nesting.

Python stays a thin method that takes the same per-tree critical section .text and .html use. ⚡ The walk holds no state outside its stack frame, so it needs no mutex on the free-threaded build where Go's converter reaches for one. A new markdown benchmark suite measures to_markdown at 40× to over 100× faster than markdownify and html2text, validated against a markdown-it-py round-trip that renders the output back to HTML and checks no visible text was lost. Output is opinionated GFM with no options; layout-aware to_text(layout=...) (the inscriptis/w3m role) is a natural follow-up that the same walk can grow into.

closes #172

@gaborbernat gaborbernat added enhancement New feature or request area:serializer Serialization, minification, text/markdown export labels Jun 19, 2026
turbohtml could extract raw .text but had no structured text export, so a
scrape-to-Markdown pipeline still needed html2text or markdownify as a second
dependency. to_markdown() walks the existing arena tree once in C and emits
GitHub-Flavored Markdown: headings, lists, links, images, emphasis,
strikethrough, inline and fenced code, blockquotes, rules, breaks, and pipe
tables.

The design follows the field survey (html2text, markdownify, inscriptis,
JohannesKaufmann's Go converter, Rust htmd): a recursive DOM visit, not a
streaming parse, with block context threaded through the recursion rather than
re-derived by walking parent pointers. Whitespace is the hard part, so a run is
never emitted eagerly; the owed space is deferred and the emphasis markers are
deferred too, which moves a leading or trailing inner space outside the marker
instead of producing the invalid form. Plain prose is bulk-copied after its
first character. Three places the reference converters get wrong are correct
here: a code span grows its backtick fence past the content, a pipe in a table
cell is escaped, and a nested ordered list keeps its own counter on the
recursion stack.

Python stays a thin method that takes the same per-tree critical section .text
and .html use; the walk holds no state outside its stack frame, so it needs no
mutex on the free-threaded build where Go's converter reaches for one. The
benchmark suite measures it 40x to over 100x faster than markdownify and
html2text, which each parse and convert in Python.

closes tox-dev#172
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:serializer Serialization, minification, text/markdown export enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

✨ feat: Markdown / layout-aware text export (replace html2text / markdownify / inscriptis)

1 participant