// guide

PDF to Markdown for AI.
The complete RAG and LLM workflow guide.

PDFs waste 60–70% of your AI token budget on formatting overhead the model can't use. This guide covers why markdown wins, how to convert accurately, optimal RAG chunking, and a comparison of every major tool.

// quick answer

Converting a PDF to markdown before feeding it to an LLM reduces token usage by 40–95% depending on the document. A 50-page report drops from ~75,000 tokens to ~21,000. For RAG pipelines, markdown preserves semantic structure (headings, tables, lists) that enables accurate chunking — improving both retrieval precision and model comprehension.

Why PDFs are a poor format for AI

PDF was designed for print — for fixed-layout visual rendering. AI models don't render. They tokenise. Everything in a PDF that makes it look good on paper becomes token noise in an LLM context.

Repeated headers and footers

A 50-page PDF with a title, page number, and company name in the header repeats that content 50 times. The model pays tokens for every repetition.

Embedded font metadata

PDFs embed font names, character mappings, and glyph tables. These bytes appear in text extraction as garbled sequences — token noise with zero semantic value.

Layout coordinates

PDF text content includes x,y coordinates for positioning. Extraction tools strip most of these, but the resulting text often has broken word spacing and fragmented sentences.

Ligature artifacts

Common character combinations like 'fi', 'fl', 'ffi' are often stored as single special characters in PDFs. They extract as unrecognised tokens or broken characters.

Column and table confusion

Multi-column layouts and tables often extract as interleaved text — column 1 line 1 followed by column 2 line 1, rather than complete paragraphs. Models interpret this as incoherent content.

Raw PDF

50-page report

75,000

Extracted plain text

50-page report

32,000

Clean Markdown

50-page report

21,000

Why markdown is the right format for LLMs

Markdown was designed for exactly what LLMs need: a lightweight way to express document structure using plain text. The syntax is the structure — no binary encoding, no embedded metadata, no layout coordinates.

Structure survives chunking

Heading hierarchy (H1 → H2 → H3) persists in every chunk. RAG retrieval can use heading context to improve relevance.

Tables are readable

Markdown tables are plain text. Models parse them accurately without column-interleaving artefacts from PDF extraction.

Code blocks are clean

Fenced code blocks with language tags (```python) tell the model exactly how to interpret the content.

Token-minimal syntax

A Markdown heading is 2–5 characters (#, ##). An HTML heading is 9+ characters (<h1>...</h1>). Across a 50-page document, the savings compound.

PDF-to-markdown tool comparison

Not all converters are equal. The output quality varies significantly by tool — and for RAG pipelines, output quality directly affects retrieval accuracy.

SuperMD markitdown

Browser / SaaS

LLM-optimised output with per-model profiles (Claude, GPT-4o, Gemini), token savings display, no upload on free tier

Best for: In-browser conversion, privacy-sensitive docs

pymupdf4llm

Python library

Fast, good table support, native PDF parsing. Returns markdown with heading hierarchy preserved.

Best for: Python pipelines, batch processing

Marker

Python / ML

ML-based layout analysis. Best accuracy on complex layouts (academic papers, multi-column reports). Slower than pymupdf4llm.

Best for: Complex layouts, academic documents

Microsoft Markitdown

Python / Open source

Converts PDF, DOCX, XLSX, PPTX, images, audio, YouTube URLs → markdown. Broad format support.

Best for: Multi-format pipelines (not just PDF)

Mistral Document AI

API / Cloud

OCR-based, handles scanned PDFs and images. High accuracy on non-native PDFs.

Best for: Scanned documents, handwritten content

RAG chunking after markdown conversion

Markdown conversion is step one. For RAG pipelines, the converted markdown then needs to be chunked into segments for vector embedding. The chunking strategy significantly affects retrieval quality.

Fixed-size chunking

256–512 tokens10–15% overlap

Pros

Simple to implement. Predictable chunk sizes for embedding models.

Cons

Splits mid-sentence or mid-section. Loses structural context. Lower precision.

Semantic / heading-based chunking

200–800 tokens0–5% overlap

Pros

Chunks align to document sections. Heading context preserved. Better retrieval accuracy.

Cons

Variable chunk sizes. Requires markdown heading hierarchy.

Max-Min Semantic Chunking

Variable0% overlap

Pros

Embeds text first, uses semantic similarity for boundaries. Highest retrieval precision. Reduces vectors per document.

Cons

Computationally expensive. Requires embedding model at chunk time.

Recommendation:Start with semantic/heading-based chunking at 256–512 tokens. Markdown's heading hierarchy makes this straightforward — split on H2/H3 boundaries first, then by token count within sections. This outperforms fixed-size chunking with minimal added complexity.

The complete PDF → AI workflow

01

Convert PDF to clean markdown

Use a tool that preserves heading hierarchy, handles tables accurately, and strips formatting overhead. For in-browser work: SuperMD markitdown. For Python pipelines: pymupdf4llm or Marker. Verify the output — check that headings, tables, and code blocks came through correctly.

02

Clean and normalise the markdown

Remove repeated boilerplate (headers, footers, page numbers). Standardise heading levels — if the PDF has inconsistent heading styles, normalise to H1 → H2 → H3 hierarchy. Strip any remaining extraction artefacts (garbled characters, broken lines).

03

Chunk using heading boundaries

Split the document into sections at heading boundaries. Each section becomes one or more chunks. If a section exceeds 512 tokens, split further using sentence boundaries. Keep the parent heading as context prefix for each sub-chunk: '## Section Name\n\n[chunk content]'.

04

Embed and index

Embed each chunk using your model (text-embedding-3-small, Gemini text-embedding, Voyage-3, etc.). Store in a vector database (Pinecone, Weaviate, Qdrant, pgvector). Include metadata: source document, page range, section heading.

05

Retrieve and inject as markdown

At query time, retrieve top-k relevant chunks. Inject them into your LLM prompt as markdown — preserve the heading and bullet structure. The model reads it more accurately than plain text and uses the hierarchy to understand content relationships.

Frequently asked questions

Why is markdown better than PDF for AI?

PDFs carry extensive non-content overhead: repeated headers and footers on every page, embedded font metadata, layout coordinates, ligature artifacts, and binary encoding. An LLM pays tokens for all of it. Markdown carries only content — headings, paragraphs, tables, code — with minimal syntax overhead. A 50-page PDF that costs 75,000 tokens becomes approximately 21,000 tokens as clean markdown, a 72% reduction.

What is the best PDF to markdown converter for LLMs?

For in-browser use with LLM-optimised output (model-specific profiles, token savings display): SuperMD markitdown. For Python-based pipelines: pymupdf4llm (fast, good table support) or Marker (ML-based, highest accuracy on complex layouts). For Microsoft Office files in addition to PDF: Microsoft's Markitdown library (open source). For enterprise with OCR: Mistral Document AI.

What is the optimal chunk size for RAG after PDF-to-markdown conversion?

The optimal RAG chunk size after markdown conversion is 128–512 tokens, with 0–15% overlap. Chunks of 200–400 tokens give the best balance of retrieval precision and context richness. Larger chunks (800+ tokens) reduce precision — the model retrieves more noise alongside relevant content. Semantic chunking (splitting on heading boundaries and topic shifts rather than fixed token counts) outperforms fixed-size chunking.

How much does PDF-to-markdown conversion reduce token costs?

Token reduction from PDF-to-markdown conversion ranges from 40% to 95% depending on the PDF. Simple text-heavy documents: 40–60% reduction. Reports with tables and headers: 60–75% reduction. Scanned or visually complex documents: up to 95% after OCR extraction and cleanup. For RAG applications, the reduction also improves retrieval accuracy since semantic structure is preserved.

Does converting PDF to markdown lose any information?

Well-executed PDF-to-markdown conversion preserves all semantically meaningful content: headings, paragraphs, tables, lists, code blocks, and image captions. What is intentionally discarded is non-semantic overhead: page numbers, repeated headers/footers, layout coordinates, font metadata, and visual formatting. For AI consumption, this discarded content was noise — the model cannot use it meaningfully anyway.

Convert your PDFs to LLM-ready markdown

Drop a PDF, DOCX, XLSX, or image. Get clean markdown with per-model profiles. Free, in-browser.