html-to-mdby SuperMD

// html-to-md

Convert HTML to clean markdown.

Paste raw HTML or drop a URL. SuperMD strips nav, scripts, and noise — converts the content to clean markdown your LLM can actually read.

// what gets removed

<script> blocks
<style> sheets
<nav> menus
<header> / <footer>
HTML comments
Inline attributes

Markdown output appears here

// why html to markdown

HTML is for browsers. Markdown is for LLMs.

Raw HTML burns context window on tags, attributes, inline styles, and script blocks the model has to mentally discard. A typical blog post HTML is 3–5× larger than its markdown equivalent — and the model still has to extract the same content.

Converting to markdown first strips the noise and gives your LLM a clean, linear version of the content — the same information, fewer tokens, better results.

// html

<div class="post-content
  container mx-auto">
  <h1 class="text-4xl font-bold
    mb-4 tracking-tight">
    Hello World
  </h1>
  <p class="text-gray-600
    leading-7">
    Content here...
  </p>
</div>

~180 tokens

// markdown

# Hello World

Content here...

~12 tokens

// use cases

When do you convert HTML to markdown?

Web scraping for RAG pipelines

RAG

Scrape pages, strip HTML, get clean markdown chunks ready to embed. Avoids the HTML-parsing step in your ingestion pipeline.

Feeding docs to your LLM

Context

Paste API docs, blog posts, or changelogs as markdown instead of HTML. The model focuses on the content, not the DOM structure.

Converting CMS content

CMS

Export from WordPress, Notion, or any CMS as HTML — convert to markdown for storage, version control, or LLM consumption.

Documentation ingestion

Docs

Technical docs are often HTML-heavy. Convert them to clean markdown before adding to a knowledge base or feeding to an AI assistant.

// faq

Frequently asked questions

Does the HTML get sent to a server?

Only when using the URL tab — the server fetches the page on your behalf to avoid CORS restrictions. When you paste HTML directly, conversion runs entirely in your browser using the Turndown library. No HTML is stored.

What does 'fetch URL' do exactly?

The server fetches the public URL, extracts the main content block (article, main, or body), strips scripts/styles/nav/footer, and returns the HTML. Turndown then converts it to markdown in your browser.

Can I convert private or authenticated pages?

Not via the URL tab — the fetch runs server-side without your session cookies. For authenticated content, copy the page source (Cmd+U in Chrome) and paste it into the Paste tab instead.

How does it compare to Pandoc?

Pandoc produces more complete conversions for complex HTML but requires a local install. This tool runs entirely in the browser, handles common web page patterns well, and is optimised for LLM consumption rather than document fidelity.