定制 labrodev/document-sampler 二次开发

按需修改功能、优化性能、对接业务系统,提供一站式技术支持

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

labrodev/document-sampler

最新稳定版本:1.0.0

Composer 安装命令:

composer require labrodev/document-sampler

包简介

Extracts a structured representative sample from long documents for downstream AI processing.

README 文档

README

Pure PHP library that extracts a structured, representative sample from a document of any length. No framework dependency, no HTTP calls, no AI — just text processing.

Designed as the input layer for downstream AI-powered packages such as relevance checkers, prompt injection detectors, and depersonalisation services.

Requirements

  • PHP ^8.5

Installation

composer require labrodev/document-sampler

Basic usage

use Labrodev\DocumentSampler\DocumentSampler;

$result = (new DocumentSampler())->sample($rawText);

$result->intro             // opening chars — title and introduction
$result->outline           // extracted section headings from anywhere in the document
$result->middle            // fixed window centred on the document midpoint
$result->tail              // closing chars — conclusion and sign-off
$result->text              // all samples joined with separators
$result->charCount         // character count of the combined sample
$result->originalCharCount // character count of the original document

Custom window sizes

By default each zone uses the window defined on the DocumentPart enum. Pass any subset to the constructor to override:

// Override specific zones — unset zones use the enum defaults
$sampler = new DocumentSampler(
    intro:   2000,
    middle:  300,
);

$result = $sampler->sample($rawText);

How it works

The sampler partitions every document into four fixed-size windows regardless of document length:

Zone Default window What it captures
intro 1000 chars Title, abstract, opening paragraphs
outline 500 chars Section headings (# Markdown, 1.1 Numbered, ALL-CAPS lines)
middle 500 chars Window centred on the document midpoint
tail 500 chars Closing paragraphs, conclusion, signature

Windows are fixed — a 400-page PDF gets the same sized sample as a one-page memo. The goal is a compact, representative fingerprint of the document, not a summary.

Exporting results

JSON

$result->toJson();
{
    "meta": {
        "originalCharCount": 50000,
        "sampledCharCount": 2300
    },
    "samples": {
        "intro": "...",
        "outline": "...",
        "middle": "...",
        "tail": "..."
    }
}

Markdown

$result->toMd();
## Document Sample

**Original size:** 50,000 chars
**Sampled size:** 2,300 chars

### Intro
...

### Outline
...

### Middle
...

### Tail
...

Empty zones are omitted from both outputs.

Default window sizes

Window sizes are defined on the DocumentPart enum and can be read at runtime:

use Labrodev\DocumentSampler\Enums\DocumentPart;

DocumentPart::Intro->chars();   // 1000
DocumentPart::Outline->chars(); // 500
DocumentPart::Middle->chars();  // 500
DocumentPart::Tail->chars();    // 500

When to use this

  • Before calling an AI API — reduce a large document to a structured excerpt that fits in a context window without losing structural information.
  • Relevance checking — feed $result->text to a classifier to decide whether a document is relevant before processing it in full.
  • Prompt injection detection — scan a compact sample for malicious instructions before passing untrusted documents to an LLM.
  • Depersonalisation — run PII detection over a representative sample before deciding whether to redact the full document.
  • Document classification — use the outline and intro zones to classify document type without reading the entire file.

Testing

composer test

Static analysis

composer analyse

Author

Petro Lashyncontact@labrodev.com

License

MIT

统计信息

  • 总下载量: 0
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 0
  • 点击次数: 7
  • 依赖项目数: 0
  • 推荐数: 0

GitHub 信息

  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • 开发语言: PHP

其他信息

  • 授权协议: MIT
  • 更新时间: 2026-04-24

承接程序开发

PHP开发

VUE

Vue开发

前端开发

小程序开发

公众号开发

系统定制

数据库设计

云部署

网站建设

安全加固