// guide
PDF to Markdown for AI.
The complete RAG and LLM workflow guide.
PDFs waste 60–70% of your AI token budget on formatting overhead the model can't use. This guide covers why markdown wins, how to convert accurately, optimal RAG chunking, and a comparison of every major tool.
// quick answer
Converting a PDF to markdown before feeding it to an LLM reduces token usage by 40–95% depending on the document. A 50-page report drops from ~75,000 tokens to ~21,000. For RAG pipelines, markdown preserves semantic structure (headings, tables, lists) that enables accurate chunking — improving both retrieval precision and model comprehension.
Why PDFs are a poor format for AI
PDF was designed for print — for fixed-layout visual rendering. AI models don't render. They tokenise. Everything in a PDF that makes it look good on paper becomes token noise in an LLM context.
Repeated headers and footers
A 50-page PDF with a title, page number, and company name in the header repeats that content 50 times. The model pays tokens for every repetition.
Embedded font metadata
PDFs embed font names, character mappings, and glyph tables. These bytes appear in text extraction as garbled sequences — token noise with zero semantic value.
Layout coordinates
PDF text content includes x,y coordinates for positioning. Extraction tools strip most of these, but the resulting text often has broken word spacing and fragmented sentences.
Ligature artifacts
Common character combinations like 'fi', 'fl', 'ffi' are often stored as single special characters in PDFs. They extract as unrecognised tokens or broken characters.
Column and table confusion
Multi-column layouts and tables often extract as interleaved text — column 1 line 1 followed by column 2 line 1, rather than complete paragraphs. Models interpret this as incoherent content.
Raw PDF
50-page report
75,000
Extracted plain text
50-page report
32,000
Clean Markdown
50-page report
21,000
Why markdown is the right format for LLMs
Markdown was designed for exactly what LLMs need: a lightweight way to express document structure using plain text. The syntax is the structure — no binary encoding, no embedded metadata, no layout coordinates.
Structure survives chunking
Heading hierarchy (H1 → H2 → H3) persists in every chunk. RAG retrieval can use heading context to improve relevance.
Tables are readable
Markdown tables are plain text. Models parse them accurately without column-interleaving artefacts from PDF extraction.
Code blocks are clean
Fenced code blocks with language tags (```python) tell the model exactly how to interpret the content.
Token-minimal syntax
A Markdown heading is 2–5 characters (#, ##). An HTML heading is 9+ characters (<h1>...</h1>). Across a 50-page document, the savings compound.
PDF-to-markdown tool comparison
Not all converters are equal. The output quality varies significantly by tool — and for RAG pipelines, output quality directly affects retrieval accuracy.
pymupdf4llm
Python libraryFast, good table support, native PDF parsing. Returns markdown with heading hierarchy preserved.
Best for: Python pipelines, batch processing
Marker
Python / MLML-based layout analysis. Best accuracy on complex layouts (academic papers, multi-column reports). Slower than pymupdf4llm.
Best for: Complex layouts, academic documents
Microsoft Markitdown
Python / Open sourceConverts PDF, DOCX, XLSX, PPTX, images, audio, YouTube URLs → markdown. Broad format support.
Best for: Multi-format pipelines (not just PDF)
Mistral Document AI
API / CloudOCR-based, handles scanned PDFs and images. High accuracy on non-native PDFs.
Best for: Scanned documents, handwritten content
RAG chunking after markdown conversion
Markdown conversion is step one. For RAG pipelines, the converted markdown then needs to be chunked into segments for vector embedding. The chunking strategy significantly affects retrieval quality.
Fixed-size chunking
Pros
Simple to implement. Predictable chunk sizes for embedding models.
Cons
Splits mid-sentence or mid-section. Loses structural context. Lower precision.
Semantic / heading-based chunking
Pros
Chunks align to document sections. Heading context preserved. Better retrieval accuracy.
Cons
Variable chunk sizes. Requires markdown heading hierarchy.
Max-Min Semantic Chunking
Pros
Embeds text first, uses semantic similarity for boundaries. Highest retrieval precision. Reduces vectors per document.
Cons
Computationally expensive. Requires embedding model at chunk time.
Recommendation:Start with semantic/heading-based chunking at 256–512 tokens. Markdown's heading hierarchy makes this straightforward — split on H2/H3 boundaries first, then by token count within sections. This outperforms fixed-size chunking with minimal added complexity.
The complete PDF → AI workflow
Convert PDF to clean markdown
Use a tool that preserves heading hierarchy, handles tables accurately, and strips formatting overhead. For in-browser work: SuperMD markitdown. For Python pipelines: pymupdf4llm or Marker. Verify the output — check that headings, tables, and code blocks came through correctly.
Clean and normalise the markdown
Remove repeated boilerplate (headers, footers, page numbers). Standardise heading levels — if the PDF has inconsistent heading styles, normalise to H1 → H2 → H3 hierarchy. Strip any remaining extraction artefacts (garbled characters, broken lines).
Chunk using heading boundaries
Split the document into sections at heading boundaries. Each section becomes one or more chunks. If a section exceeds 512 tokens, split further using sentence boundaries. Keep the parent heading as context prefix for each sub-chunk: '## Section Name\n\n[chunk content]'.
Embed and index
Embed each chunk using your model (text-embedding-3-small, Gemini text-embedding, Voyage-3, etc.). Store in a vector database (Pinecone, Weaviate, Qdrant, pgvector). Include metadata: source document, page range, section heading.
Retrieve and inject as markdown
At query time, retrieve top-k relevant chunks. Inject them into your LLM prompt as markdown — preserve the heading and bullet structure. The model reads it more accurately than plain text and uses the hierarchy to understand content relationships.
Frequently asked questions
Why is markdown better than PDF for AI?
PDFs carry extensive non-content overhead: repeated headers and footers on every page, embedded font metadata, layout coordinates, ligature artifacts, and binary encoding. An LLM pays tokens for all of it. Markdown carries only content — headings, paragraphs, tables, code — with minimal syntax overhead. A 50-page PDF that costs 75,000 tokens becomes approximately 21,000 tokens as clean markdown, a 72% reduction.
What is the best PDF to markdown converter for LLMs?
For in-browser use with LLM-optimised output (model-specific profiles, token savings display): SuperMD markitdown. For Python-based pipelines: pymupdf4llm (fast, good table support) or Marker (ML-based, highest accuracy on complex layouts). For Microsoft Office files in addition to PDF: Microsoft's Markitdown library (open source). For enterprise with OCR: Mistral Document AI.
What is the optimal chunk size for RAG after PDF-to-markdown conversion?
The optimal RAG chunk size after markdown conversion is 128–512 tokens, with 0–15% overlap. Chunks of 200–400 tokens give the best balance of retrieval precision and context richness. Larger chunks (800+ tokens) reduce precision — the model retrieves more noise alongside relevant content. Semantic chunking (splitting on heading boundaries and topic shifts rather than fixed token counts) outperforms fixed-size chunking.
How much does PDF-to-markdown conversion reduce token costs?
Token reduction from PDF-to-markdown conversion ranges from 40% to 95% depending on the PDF. Simple text-heavy documents: 40–60% reduction. Reports with tables and headers: 60–75% reduction. Scanned or visually complex documents: up to 95% after OCR extraction and cleanup. For RAG applications, the reduction also improves retrieval accuracy since semantic structure is preserved.
Does converting PDF to markdown lose any information?
Well-executed PDF-to-markdown conversion preserves all semantically meaningful content: headings, paragraphs, tables, lists, code blocks, and image captions. What is intentionally discarded is non-semantic overhead: page numbers, repeated headers/footers, layout coordinates, font metadata, and visual formatting. For AI consumption, this discarded content was noise — the model cannot use it meaningfully anyway.
Convert your PDFs to LLM-ready markdown
Drop a PDF, DOCX, XLSX, or image. Get clean markdown with per-model profiles. Free, in-browser.