Upload any PDF, image, or Office document. Get clean Markdown & JSON — optimized for RAG pipelines, AI agents, and LLM workflows.
const result = await datascrub.parse("./report.pdf", {
output: "markdown", // or "json" | "chunks"
chunkSize: 512, // auto-chunk for RAG
extractTables: true, // structured table output
languages: ["en", "zh"] // best CJK support
});
// → Clean Markdown + metadata + chunks
console.log(result.markdown);Stop wrestling with PDF parsers. Start building your AI product.
Process a 100-page PDF in under 10 seconds. Parallel page extraction with GPU-accelerated OCR.
Auto-chunking, metadata enrichment, and embedding-ready JSON. Skip the preprocessing pipeline.
Industry-leading Chinese, Japanese, and Korean parsing. Mixed-language documents handled natively.
Complex tables, merged cells, multi-page tables — all converted to structured Markdown or JSON arrays.
REST API + Node/Python SDKs. Drop-in replacement for LlamaParse or Unstructured.
Documents processed in memory, never stored. Full audit trail. Enterprise-ready from day one.
| Tool | CJK / Chinese | Tables | RAG Output | Pricing | API DX |
|---|---|---|---|---|---|
| DataScrub | ✅ Best | ✅ | ✅ Built-in | $49/mo | ✅ 3 lines |
| LlamaParse | ⚠️ Weak | ⚠️ | ✅ LlamaIndex only | $0.003/pg | ✅ |
| Unstructured | ⚠️ Basic | ⚠️ | ❌ | $0.01/pg | ⚠️ Complex |
| MinerU | ✅ Best | ✅ | ❌ No API | Self-host | ❌ |
| Reducto | ❌ | ✅ | ❌ | $0.01/pg | ✅ |
No per-page anxiety. Flat monthly plans with generous limits.
100 pages/mo
5,000 pages/mo
25,000 pages/mo
100 free pages every month. No credit card required. See results in 10 seconds.
Try DataScrub Free