// markitdown
Convert any file to LLM-optimized markdown.
Drop a PDF, DOCX, XLSX, or image. SuperMD strips the noise, applies your chosen model's format, and shows you exactly how many tokens you saved — all in your browser.
Drop your file
PDF, DOCX, XLSX, or image. Up to 5 MB. Processed entirely in your browser.
Pick your model
Claude, GPT-4o, or Gemini. Each gets a different format tuned to how it reads.
Get clean markdown
See tokens saved, copy to clipboard, or download the .md file — ready to paste.
Drop a file or click to browse
PDF, DOCX, XLSX, CSV, JPG, PNG — up to 5 MB
XML-structured output with <document> tags — optimized for Claude's 200K context.
Your file never leaves your browser. Conversion runs locally — no upload, no server.
// use cases
Who uses markitdown?
Building RAG pipelines
Convert PDFs, DOCX, and spreadsheets into clean, chunked markdown ready to embed into Pinecone, Weaviate, or any vector database. The token-optimized output reduces embedding costs by up to 60%.
Feeding context to Claude or ChatGPT
Instead of uploading raw files and hoping the model extracts what it needs, paste clean markdown directly. Less noise means the model focuses on your actual content, not structural artefacts.
Processing research papers and reports
Academic PDFs repeat headers, footers, and citations across every page. markitdown strips the repetition and gives you a clean linear text — ready to summarise, analyse, or query.
Preparing data for fine-tuning
Training datasets need clean, consistent text. Convert a folder of DOCX or PDF documents to uniform markdown, then feed them into your fine-tuning pipeline without manual cleanup.
// supported formats
Every major file type, one tool.
Text extraction from any PDF. Strips page numbers, headers, footers that repeat every page.
Up to 58% fewer tokensFull Microsoft Word support. Preserves headings, bold, tables, and lists. Strips XML markup.
Up to 41% fewer tokensSpreadsheets become clean markdown tables. Multi-sheet XLSX files get one section per sheet.
Up to 63% fewer tokensOCR extracts text from JPG, PNG, WebP, and TIFF. Runs entirely in your browser via WebAssembly.
Up to 34% fewer tokens// model profiles
Why does formatting matter per model?
Each LLM was trained on different data and has different preferences for how context is structured. Using the wrong format doesn't cause failure — it just wastes tokens on structure the model has to mentally discard.
XML tags like <document> and <section> match how Claude was trained to parse long-context documents. Anthropic recommends this structure in their own prompt engineering guide.
Best for: Long documents, RAG, structured analysis
YAML frontmatter and standard ATX headings (## H2) match GPT-4o's markdown training. XML tags add noise. Aggressive empty-line stripping saves tokens without losing structure.
Best for: Chat completion, code tasks, summarisation
With a 1M token context window, chunking is rarely needed. Clean prose with consistent heading hierarchy is sufficient — Gemini handles long continuous documents better than most models.
Best for: Very long documents, whole-codebase analysis
// faq
Frequently asked questions
Does my file get uploaded to a server?
No. The free tier runs entirely in your browser using WebAssembly. Your files are never sent to our servers. This is especially important for confidential documents like financial reports or legal contracts.
How much can I realistically save in tokens?
It depends on the file type and content. PDFs with repeated headers and footers across many pages typically save 40–60%. Spreadsheets with redundant column labels save 40–65%. Plain DOCX documents save 20–40%.
What file size is supported?
The free tier supports files up to 5 MB, which covers most documents. A typical 50-page PDF is under 2 MB. Images and scanned documents may be larger — a Pro tier with 50 MB support is coming soon.
Can I use the markdown output in any LLM tool?
Yes. The output is plain markdown that works everywhere — Claude, ChatGPT, Gemini, Perplexity, Cursor, Copilot, and any RAG framework like LangChain or LlamaIndex.
What is RAG-ready chunking?
RAG (Retrieval-Augmented Generation) splits long documents into smaller overlapping chunks for vector search. markitdown can split your output at semantic boundaries (headings, paragraphs) and export JSON with per-chunk metadata for direct Pinecone or Weaviate ingestion.
Is this the same as Microsoft's markitdown?
No. Microsoft released a Python library also called markitdown. SuperMD's markitdown is a browser-based tool with LLM-specific profiles, token savings display, and RAG-ready output — features the Python library doesn't have.