✨ feat(serialize): add to_markdown GFM tree exporter#191
Open
gaborbernat wants to merge 1 commit into
Open
Conversation
turbohtml could extract raw .text but had no structured text export, so a scrape-to-Markdown pipeline still needed html2text or markdownify as a second dependency. to_markdown() walks the existing arena tree once in C and emits GitHub-Flavored Markdown: headings, lists, links, images, emphasis, strikethrough, inline and fenced code, blockquotes, rules, breaks, and pipe tables. The design follows the field survey (html2text, markdownify, inscriptis, JohannesKaufmann's Go converter, Rust htmd): a recursive DOM visit, not a streaming parse, with block context threaded through the recursion rather than re-derived by walking parent pointers. Whitespace is the hard part, so a run is never emitted eagerly; the owed space is deferred and the emphasis markers are deferred too, which moves a leading or trailing inner space outside the marker instead of producing the invalid form. Plain prose is bulk-copied after its first character. Three places the reference converters get wrong are correct here: a code span grows its backtick fence past the content, a pipe in a table cell is escaped, and a nested ordered list keeps its own counter on the recursion stack. Python stays a thin method that takes the same per-tree critical section .text and .html use; the walk holds no state outside its stack frame, so it needs no mutex on the free-threaded build where Go's converter reaches for one. The benchmark suite measures it 40x to over 100x faster than markdownify and html2text, which each parse and convert in Python. closes tox-dev#172
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
turbohtml could pull raw
.textoff a node but had no structured text export, so the commonscrape→Markdownpipeline still neededhtml2textormarkdownifyas a second dependency.to_markdown()closes that gap: it walks the existing arena tree once in C and emits GitHub-Flavored Markdown covering headings, lists, links, images, emphasis, strikethrough, inline and fenced code, blockquotes, horizontal rules, line breaks, and pipe tables. 📝The design comes out of a source-level survey across languages — the Python
html2text,markdownify, andinscriptis, JohannesKaufmann's Gohtml-to-markdown, and Rust'shtmd, plus the C/C++ layout browsersw3m,lynx, and GNUhtml2text. They converge on a recursive DOM visit rather than a streaming parse, with the block context threaded through the recursion instead of re-derived by walking parent pointers. Whitespace is the hard part, so a run is never emitted eagerly; the owed space is deferred, and the emphasis markers are deferred with it, which moves a leading or trailing inner space outside the marker rather than producing the invalid**bold **form. Plain prose is bulk-copied after its first character, the borrow-or-copy fast pathhtmduses. Three spots the reference converters get wrong are correct here: an inline code span grows its backtick fence past any backticks in the content, a|inside a table cell is escaped, and a nested ordered list keeps its own counter on the recursion stack rather than a single field that corrupts on nesting.Python stays a thin method that takes the same per-tree critical section
.textand.htmluse. ⚡ The walk holds no state outside its stack frame, so it needs no mutex on the free-threaded build where Go's converter reaches for one. A newmarkdownbenchmark suite measuresto_markdownat 40× to over 100× faster thanmarkdownifyandhtml2text, validated against amarkdown-it-pyround-trip that renders the output back to HTML and checks no visible text was lost. Output is opinionated GFM with no options; layout-awareto_text(layout=...)(theinscriptis/w3mrole) is a natural follow-up that the same walk can grow into.closes #172