包简介

SRX 2.0 segmentation rule parser for the OpenCAT Framework

README 文档

README

SRX 2.0 segmentation rule parser for the OpenCAT Framework.

Parses .srx files into a SegmentationRuleSet that the opencat/segmentation engine uses to split text into sentences. You only need this package directly if you want to load custom SRX files; the segmentation engine loads the bundled default automatically.

Installation

composer require opencat/srx

Requires ext-dom and ext-libxml.

Usage

use CatFramework\Srx\SrxParser;

$parser = new SrxParser();
$ruleSet = $parser->parse('/path/to/rules.srx');

// Look up rules for a given BCP 47 language code
$languageRule = $ruleSet->rulesFor('en-US');

foreach ($languageRule->rules as $rule) {
    echo $rule->break ? 'break' : 'no-break';
    echo '  before: ' . $rule->before;
    echo '  after: '  . $rule->after;
}

Bundled default SRX

The package ships a data/default.srx file covering:

English (EN.*)
Hindi (HI.*) — Devanagari Purna Viram ।
Urdu (UR.*) — Arabic Full Stop ۔
Arabic (AR.*)
French (FR.*)
German (DE.*)
Spanish (ES.*)
Chinese / Japanese (ZH.*, JA.*)
default fallback rule (period followed by space and uppercase)

Get its path via the static helper:

$path = SrxParser::defaultSrxPath();

SRX format overview

SRX 2.0 is an XML format. A rule set contains:

<languagerule> blocks — named sets of break/no-break rules for a language
<languagemap> entries — BCP 47 regex patterns mapped to rule names

The parser resolves a language code by scanning <languagemap> entries in document order and returning the first match. If no rule matches, an empty LanguageRule is returned (no segmentation).

<languagemap languagepattern="EN.*" languagerulename="English"/>

Each <rule> inside a <languagerule> has:

break="yes|no" — whether this position is a sentence boundary
<beforebreak> — regex that must match text before the candidate break position
<afterbreak> — regex that must match text after

Rules are evaluated in order — the first matching rule wins.

Classes

Class	Purpose
`SrxParser`	Parses an SRX file into a `SegmentationRuleSet`
`SegmentationRuleSet`	Holds all language rules and maps a BCP 47 code to a `LanguageRule`
`LanguageRule`	A named list of `SegmentationRule` objects for one language
`SegmentationRule`	A single break/no-break rule with `before` and `after` patterns

Writing custom SRX rules

<?xml version="1.0" encoding="UTF-8"?>
<srx version="2.0" xmlns="http://www.lisa.org/srx20">
  <header segmentsubflows="yes" cascade="yes"/>
  <body>
    <languagerules>
      <languagerule languagerulename="English">
        <!-- No break: abbreviations -->
        <rule break="no">
          <beforebreak>\b(Mr|Mrs|Dr|Prof)\.</beforebreak>
          <afterbreak>\s</afterbreak>
        </rule>
        <!-- Break: sentence end -->
        <rule break="yes">
          <beforebreak>[.!?]</beforebreak>
          <afterbreak>\s+[A-Z]</afterbreak>
        </rule>
      </languagerule>
    </languagerules>
    <maprules>
      <languagemap languagepattern="EN.*" languagerulename="English"/>
    </maprules>
  </body>
</srx>

Related packages

opencat/core — SegmentationException thrown on parse failure
opencat/segmentation — consumes SegmentationRuleSet to split Segment objects

opencat/srx 适用场景与选型建议

opencat/srx 是一款基于 PHP 开发的 Composer 扩展包，目前已累计 0 次下载、GitHub Stars 达 0，最近一次更新时间为 2026 年 05 月 09 日，在 PHP 生态内属于活跃度较高的组件。

我们在过去多个企业项目中使用过 opencat/srx 或与其功能相近的方案，如果你在选型或落地过程中遇到问题，例如 版本兼容、二次改造、私有化封装、与内部系统对接、生产 BUG 排查，欢迎联系我们协助评估。

围绕 opencat/srx 我们能提供哪些服务？

定制开发 / 二次开发

基于 opencat/srx 在你已有业务上做功能扩展、字段裁剪、UI 适配、与内部账号 / 权限 / 日志系统的深度对接。

BUG 修复 & 性能优化

线上偶发问题、内存泄漏、慢查询、并发异常等排查修复；针对高流量场景做缓存、队列、索引层面的调优。

项目外包 & 长期维护

承接完整的项目从需求 → 设计 → 开发 → 上线 → 长期运维；也可按月提供技术保姆服务。

yvsm@zunyunkeji.com QQ：316430983 微信：yvsm316 西安尊云信息科技 · 专注 PHP / Go / 分布式系统研发

opencat/srx

包简介