kreuzberg-dev/kreuzberg
Composer 安装命令:
pie install kreuzberg-dev/kreuzberg
包简介
High-performance document intelligence library
README 文档
README
Extract text, metadata, transcripts, and code intelligence from 96 file formats and 306 programming languages at native speeds without needing a GPU.
What and Why?
Kreuzberg is a document-intelligence framework with a Rust core and native bindings for 16 languages. It turns documents, images, audio, and source code into clean, structured text — extracting tables, metadata, transcripts, and code intelligence from 96 file formats and 306 programming languages.
Modern AI and RAG pipelines need fast, reliable extraction without a GPU or a stack of heavyweight dependencies. Kreuzberg delivers that from a single Rust core: SIMD-accelerated parsing, pure-Rust PDF, streaming for multi-GB files, and consistent output across every binding. Run it as a library, CLI, REST API, or MCP server.
OCR (Tesseract, PaddleOCR, EasyOCR, and VLM across 143 vision providers), Whisper audio/video transcription, chunking, language detection, embeddings, and structured LLM extraction are all built in.
Features
| Feature | Description |
|---|---|
| 96 file formats | PDF, Office, images, HTML/XML, email, archives, and academic formats across 8 categories |
| 306 languages | Code intelligence — functions, classes, imports, symbols, docstrings — via tree-sitter |
| Polyglot | Native bindings for Rust, Python, Node.js, WebAssembly, Ruby, Go, Java, Kotlin, C#, PHP, Elixir, R, Dart, Swift, Zig, and C |
| OCR | Tesseract (incl. WASM), PaddleOCR, EasyOCR, and VLM OCR across 143 vision providers — extensible via plugins |
| Transcription | Whisper ONNX transcripts for MP3, M4A, WAV, WebM, and MP4 audio tracks |
| LLM intelligence | Structured JSON extraction, embeddings, and VLM OCR through liter-llm, including local engines |
| Deployment | Use as a library, CLI tool, REST API server, or MCP server |
| High performance | Rust core with pure-Rust PDF, SIMD optimizations, full parallelism, and streaming for multi-GB files |
| Token-efficient output | TOON wire format uses ~30–50% fewer tokens than JSON for LLM/RAG pipelines |
| Extensible | Plugin system for custom OCR backends, validators, post-processors, extractors, and renderers |
Supported Formats
96 file formats across 8 categories — Office documents, images (OCR-enabled), web and structured data, email, archives, academic, and audio/video — plus code intelligence for 306 programming languages. See the format reference for the complete list.
⭐ Star this repo to show your support — it helps others discover Kreuzberg.
Quick Start
Language Packages
Java
Available on Maven Central as dev.kreuzberg:kreuzberg. See Java README for the dependency snippet and current version.
Elixir
Add {:kreuzberg, "~> 5.0"} to your mix.exs dependencies. See Elixir README for full documentation.
R
Install from r-universe. See R README for full documentation.
Kotlin (Android)
Available on Maven Central as dev.kreuzberg:kreuzberg-android. See Kotlin README for the dependency snippet and current version.
Swift
Add via Swift Package Manager. See Swift README for full documentation.
Zig
Add via zig fetch. See Zig README for full documentation.
C/C++ (FFI)
Build from source as part of this workspace. See C (FFI) README for full documentation.
Docker
docker pull ghcr.io/kreuzberg-dev/kreuzberg:latest
See Docker guide for API, CLI, and MCP server modes.
AI Coding Assistants
Install the Kreuzberg plugin from the kreuzberg-dev/plugins marketplace. It ships the Kreuzberg agent skills (extraction APIs, OCR backends, configuration, language conventions) and works with every major coding agent — expand your harness below.
Claude Code
/plugin marketplace add kreuzberg-dev/plugins
/plugin install kreuzberg@kreuzberg
Codex CLI
/plugins add https://github.com/kreuzberg-dev/plugins
Then search for kreuzberg and select Install Plugin.
Cursor
Settings → Plugins → Add from URL → https://github.com/kreuzberg-dev/plugins, then select kreuzberg.
Gemini CLI
gemini extensions install https://github.com/kreuzberg-dev/plugins
Factory Droid
droid plugin marketplace add https://github.com/kreuzberg-dev/plugins
droid plugin install kreuzberg@kreuzberg
GitHub Copilot CLI
copilot plugin marketplace add https://github.com/kreuzberg-dev/plugins
copilot plugin install kreuzberg@kreuzberg
opencode
Add the package to opencode.json:
{
"$schema": "https://opencode.ai/config.json",
"plugin": ["@kreuzberg/opencode-kreuzberg"]
}
Documentation
Full guides, API references for every binding, and the complete format and configuration reference live at kreuzberg.dev. Try it in the browser with the live demo.
Contributing
Contributions are welcome! See CONTRIBUTING.md for guidelines.
Join our Discord community for questions and discussion.
Part of Kreuzberg.dev
- Kreuzberg Cloud — managed extraction API with SDKs, dashboards, and observability.
- kreuzcrawl — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
- html-to-markdown — fast, lossless HTML→Markdown engine.
- liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
- tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
- alef — the polyglot binding generator that produces every per-language binding across the 5 polyglot repos.
License
Elastic License 2.0 (ELv2) — see LICENSE for details. See the Elastic License for the full text.
统计信息
- 总下载量: 0
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 0
- 点击次数: 2
- 依赖项目数: 0
- 推荐数: 0
其他信息
- 授权协议: Elastic-2.0
- 更新时间: 2026-06-21