定制 nuglif/chunkytown 二次开发

按需修改功能、优化性能、对接业务系统,提供一站式技术支持

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

nuglif/chunkytown

Composer 安装命令:

composer require nuglif/chunkytown

包简介

French paragraph/list/sentence-aware text chunker for RAG pipelines.

README 文档

README

Text chunker for French RAG pipelines. Splits a document into overlapping spans that respect natural boundaries (paragraphs, lists, sentences, words) instead of cutting in the middle of a sentence — and never inside a composed emoji or accented grapheme.

Built on PHP's intl extension (ICU), with a French‑tuned abbreviation filter so common shortcuts (M. Dupont, St. Patrick, cf., etc.) are not mistaken for sentence ends.

Installation

composer require nuglif/chunkytown

Requires PHP 8.1+ with the intl and mbstring extensions.

Quick start

use ChunkyTown\Chunker;
use ChunkyTown\ChunkerConfig;

$text = file_get_contents('article.txt');

$chunker = new Chunker();
$cfg = new ChunkerConfig(
    chunkSize: 1500,   // target graphemes per chunk
    window: 175,       // ± snapping window around the chunk end
    overlap: 275,      // desired overlap between consecutive chunks
    overlapWindow: 150 // ± snapping window when looking for the overlap start
);

$spans = $chunker->chunk($text, $cfg);
// $spans is a list of [start, end) tuples in grapheme offsets.

foreach ($chunker->render($text, $spans) as $i => $chunk) {
    echo "=== CHUNK " . ($i + 1) . " ===\n$chunk\n\n";
}

Reusable analysis (multi‑config runs)

If you chunk the same text under several configs (tuning, A/B sweeps), pre‑compute the boundary detection once with analyze():

$analysis = $chunker->analyze($text);

$spansA = $chunker->chunk($analysis, new ChunkerConfig(chunkSize: 1000));
$spansB = $chunker->chunk($analysis, new ChunkerConfig(chunkSize: 2000));
// boundary detection runs once, snapping runs twice.

Configuration

ChunkerConfig is an immutable value object:

Option Default Meaning
chunkSize 300 Target chunk length in grapheme clusters (user‑perceived characters).
window 50 ± window around the target end to snap to a boundary.
preferOrder ['paragraph', 'list', 'sentence', 'word'] Boundary kinds tried in order when snapping.
overlap 0 Desired overlap (graphemes) between successive chunks.
overlapWindow null (falls back to window) ± window used when locating the start of the next chunk.
normalize false Normalize the input to Unicode NFC before analysis (use for mixed NFC/NFD).

External boundaries

If you already have sentence/paragraph offsets from another parser, merge them in (offsets are in graphemes):

$spans = $chunker->chunk($text, $cfg, externalBoundaries: [
    'sentence' => [120, 305, 480],
    'paragraph' => [0, 480, 1100],
]);

Offsets and slicing

All offsets returned by the chunker are grapheme cluster offsets (emojis with ZWJ, regional‑indicator flags, skin‑tone modifiers and NFD‑decomposed accents all count as a single unit). Slice safely with Chunker::render(), which preserves grapheme integrity for you.

What's special

  • ICU sentence detection with a built‑in French abbreviation filter — M. Dupont, St. Patrick, cf., etc. are not mistaken for sentence ends.
  • ICU word detection — handles apostrophes (l'arbre), hyphens (peut‑être), and the typical French punctuation cleanly.
  • Grapheme‑safe: a chunk boundary will never land inside a composed emoji or a decomposed accent.
  • Reusable analysis: pay boundary detection once, re‑chunk many times.

Testing

composer install
composer test

The suite covers the grapheme mapper, every boundary detector, ChunkerConfig, TextAnalysis, and end‑to‑end chunking — including parity with the upstream Python implementation on a French fixture.

License

MIT

统计信息

  • 总下载量: 0
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 0
  • 点击次数: 5
  • 依赖项目数: 0
  • 推荐数: 0

GitHub 信息

  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • 开发语言: PHP

其他信息

  • 授权协议: MIT
  • 更新时间: 2026-07-03

承接程序开发

PHP开发

VUE

Vue开发

前端开发

小程序开发

公众号开发

系统定制

数据库设计

云部署

网站建设

安全加固