markitdownby SuperMD

// markitdown

Convert any file to LLM-optimized markdown.

Drop a PDF, DOCX, XLSX, or image. SuperMD strips the noise, applies your chosen model's format, and shows you exactly how many tokens you saved — all in your browser.

01now

Drop your file

PDF, DOCX, XLSX, or image. Up to 5 MB. Processed entirely in your browser.

02

Pick your model

Claude, GPT-4o, or Gemini. Each gets a different format tuned to how it reads.

03

Get clean markdown

See tokens saved, copy to clipboard, or download the .md file — ready to paste.

Output profile

XML-structured output with <document> tags — optimized for Claude's 200K context.

Your file never leaves your browser. Conversion runs locally — no upload, no server.

// use cases

Who uses markitdown?

Building RAG pipelines

Convert PDFs, DOCX, and spreadsheets into clean, chunked markdown ready to embed into Pinecone, Weaviate, or any vector database. The token-optimized output reduces embedding costs by up to 60%.

LangChainLlamaIndexPinecone

Feeding context to Claude or ChatGPT

Instead of uploading raw files and hoping the model extracts what it needs, paste clean markdown directly. Less noise means the model focuses on your actual content, not structural artefacts.

ClaudeGPT-4oGemini

Processing research papers and reports

Academic PDFs repeat headers, footers, and citations across every page. markitdown strips the repetition and gives you a clean linear text — ready to summarise, analyse, or query.

PDFResearchSummarisation

Preparing data for fine-tuning

Training datasets need clean, consistent text. Convert a folder of DOCX or PDF documents to uniform markdown, then feed them into your fine-tuning pipeline without manual cleanup.

DOCXCSVTraining data

// supported formats

Every major file type, one tool.

PDF

Text extraction from any PDF. Strips page numbers, headers, footers that repeat every page.

Up to 58% fewer tokens
DOCX

Full Microsoft Word support. Preserves headings, bold, tables, and lists. Strips XML markup.

Up to 41% fewer tokens
XLSX / CSV

Spreadsheets become clean markdown tables. Multi-sheet XLSX files get one section per sheet.

Up to 63% fewer tokens
Images

OCR extracts text from JPG, PNG, WebP, and TIFF. Runs entirely in your browser via WebAssembly.

Up to 34% fewer tokens

// model profiles

Why does formatting matter per model?

Each LLM was trained on different data and has different preferences for how context is structured. Using the wrong format doesn't cause failure — it just wastes tokens on structure the model has to mentally discard.

🟣 Claude200K tokens

XML tags like <document> and <section> match how Claude was trained to parse long-context documents. Anthropic recommends this structure in their own prompt engineering guide.

Best for: Long documents, RAG, structured analysis

🟢 GPT-4o128K tokens

YAML frontmatter and standard ATX headings (## H2) match GPT-4o's markdown training. XML tags add noise. Aggressive empty-line stripping saves tokens without losing structure.

Best for: Chat completion, code tasks, summarisation

🔵 Gemini1M tokens

With a 1M token context window, chunking is rarely needed. Clean prose with consistent heading hierarchy is sufficient — Gemini handles long continuous documents better than most models.

Best for: Very long documents, whole-codebase analysis

// faq

Frequently asked questions

Does my file get uploaded to a server?

No. The free tier runs entirely in your browser using WebAssembly. Your files are never sent to our servers. This is especially important for confidential documents like financial reports or legal contracts.

How much can I realistically save in tokens?

It depends on the file type and content. PDFs with repeated headers and footers across many pages typically save 40–60%. Spreadsheets with redundant column labels save 40–65%. Plain DOCX documents save 20–40%.

What file size is supported?

The free tier supports files up to 5 MB, which covers most documents. A typical 50-page PDF is under 2 MB. Images and scanned documents may be larger — a Pro tier with 50 MB support is coming soon.

Can I use the markdown output in any LLM tool?

Yes. The output is plain markdown that works everywhere — Claude, ChatGPT, Gemini, Perplexity, Cursor, Copilot, and any RAG framework like LangChain or LlamaIndex.

What is RAG-ready chunking?

RAG (Retrieval-Augmented Generation) splits long documents into smaller overlapping chunks for vector search. markitdown can split your output at semantic boundaries (headings, paragraphs) and export JSON with per-chunk metadata for direct Pinecone or Weaviate ingestion.

Is this the same as Microsoft's markitdown?

No. Microsoft released a Python library also called markitdown. SuperMD's markitdown is a browser-based tool with LLM-specific profiles, token savings display, and RAG-ready output — features the Python library doesn't have.