nuglif/chunkytown
Composer 安装命令:
composer require nuglif/chunkytown
包简介
French paragraph/list/sentence-aware text chunker for RAG pipelines.
关键字:
README 文档
README
Text chunker for French RAG pipelines. Splits a document into overlapping spans that respect natural boundaries (paragraphs, lists, sentences, words) instead of cutting in the middle of a sentence — and never inside a composed emoji or accented grapheme.
Built on PHP's intl extension (ICU), with a French‑tuned abbreviation
filter so common shortcuts (M. Dupont, St. Patrick, cf., etc.)
are not mistaken for sentence ends.
Installation
composer require nuglif/chunkytown
Requires PHP 8.1+ with the intl and mbstring extensions.
Quick start
use ChunkyTown\Chunker; use ChunkyTown\ChunkerConfig; $text = file_get_contents('article.txt'); $chunker = new Chunker(); $cfg = new ChunkerConfig( chunkSize: 1500, // target graphemes per chunk window: 175, // ± snapping window around the chunk end overlap: 275, // desired overlap between consecutive chunks overlapWindow: 150 // ± snapping window when looking for the overlap start ); $spans = $chunker->chunk($text, $cfg); // $spans is a list of [start, end) tuples in grapheme offsets. foreach ($chunker->render($text, $spans) as $i => $chunk) { echo "=== CHUNK " . ($i + 1) . " ===\n$chunk\n\n"; }
Reusable analysis (multi‑config runs)
If you chunk the same text under several configs (tuning, A/B sweeps),
pre‑compute the boundary detection once with analyze():
$analysis = $chunker->analyze($text); $spansA = $chunker->chunk($analysis, new ChunkerConfig(chunkSize: 1000)); $spansB = $chunker->chunk($analysis, new ChunkerConfig(chunkSize: 2000)); // boundary detection runs once, snapping runs twice.
Configuration
ChunkerConfig is an immutable value object:
| Option | Default | Meaning |
|---|---|---|
chunkSize |
300 |
Target chunk length in grapheme clusters (user‑perceived characters). |
window |
50 |
± window around the target end to snap to a boundary. |
preferOrder |
['paragraph', 'list', 'sentence', 'word'] |
Boundary kinds tried in order when snapping. |
overlap |
0 |
Desired overlap (graphemes) between successive chunks. |
overlapWindow |
null (falls back to window) |
± window used when locating the start of the next chunk. |
normalize |
false |
Normalize the input to Unicode NFC before analysis (use for mixed NFC/NFD). |
External boundaries
If you already have sentence/paragraph offsets from another parser, merge them in (offsets are in graphemes):
$spans = $chunker->chunk($text, $cfg, externalBoundaries: [ 'sentence' => [120, 305, 480], 'paragraph' => [0, 480, 1100], ]);
Offsets and slicing
All offsets returned by the chunker are grapheme cluster offsets
(emojis with ZWJ, regional‑indicator flags, skin‑tone modifiers and
NFD‑decomposed accents all count as a single unit). Slice safely with
Chunker::render(), which preserves grapheme integrity for you.
What's special
- ICU sentence detection with a built‑in French abbreviation filter
—
M. Dupont,St. Patrick,cf.,etc.are not mistaken for sentence ends. - ICU word detection — handles apostrophes (
l'arbre), hyphens (peut‑être), and the typical French punctuation cleanly. - Grapheme‑safe: a chunk boundary will never land inside a composed emoji or a decomposed accent.
- Reusable analysis: pay boundary detection once, re‑chunk many times.
Testing
composer install
composer test
The suite covers the grapheme mapper, every boundary detector,
ChunkerConfig, TextAnalysis, and end‑to‑end chunking — including
parity with the upstream Python implementation on a French fixture.
License
MIT
统计信息
- 总下载量: 0
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 0
- 点击次数: 5
- 依赖项目数: 0
- 推荐数: 0
其他信息
- 授权协议: MIT
- 更新时间: 2026-07-03