承接 opencat/terminology 相关项目开发

从需求分析到上线部署,全程专人跟进,保证项目质量与交付效率

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

opencat/terminology

Composer 安装命令:

composer require opencat/terminology

包简介

Term recognition and TBX import for the OpenCAT Framework

README 文档

README

Term recognition and TBX import for the OpenCAT Framework.

Parses TBX v2 (ISO 30042) glossary files and stores terms in SQLite. At translation time, scans source text for known terms and returns their target-language equivalents so the translator sees glossary matches alongside TM matches.

Installation

composer require opencat/terminology

Requires ext-dom, ext-intl, ext-mbstring, ext-pdo, and ext-pdo_sqlite.

Usage

use CatFramework\Terminology\Provider\SqliteTerminologyProvider;

$provider = new SqliteTerminologyProvider('glossary.db');
// SQLite schema is created automatically

// Import a TBX file
$count = $provider->import('legal-terms.tbx');
echo "Imported {$count} term entries";

// Recognise terms in source text
$matches = $provider->recognize(
    text: 'Please review the translation memory for consistency.',
    sourceLanguage: 'en',
    targetLanguage: 'fr',
);

foreach ($matches as $match) {
    echo $match->entry->sourceTerm . '' . $match->entry->targetTerm . PHP_EOL;
    echo "Found at offset {$match->offset}, length {$match->length}" . PHP_EOL;
}

TBX parser

The TbxParser handles both TBX v2 (<martif> root, <langSet>, <tig>) and TBX-Basic (<tbx> root, <langSec>, <termSec>):

use CatFramework\Terminology\Parser\TbxParser;

$parser = new TbxParser();
$entries = $parser->parseFile('glossary.tbx');   // returns TermEntry[]
// or from a string:
$entries = $parser->parseString($xmlString);

Each TermEntry carries:

  • $sourceTerm / $targetTerm — the term text
  • $sourceLanguage / $targetLanguage — BCP 47 codes
  • $definition — extracted from <descrip type="definition">
  • $domain — extracted from <descrip type="subjectField">
  • $forbiddentrue when administrativeStatus is deprecatedTerm or supersededTerm

If a concept has multiple terms per language, all source × target combinations are generated as individual TermEntry objects.

TBX file example

<?xml version="1.0" encoding="UTF-8"?>
<martif type="TBX" xml:lang="en">
  <text>
    <body>
      <termEntry>
        <langSet xml:lang="en">
          <tig>
            <term>translation memory</term>
            <descrip type="definition">A database of previously translated segments.</descrip>
          </tig>
        </langSet>
        <langSet xml:lang="fr">
          <tig>
            <term>mémoire de traduction</term>
          </tig>
        </langSet>
      </termEntry>
    </body>
  </text>
</martif>

Term recognition

recognize() uses Unicode-aware word-boundary detection rather than regex \b — which is byte-level and breaks for Arabic and Devanagari. Boundaries are detected using space and punctuation characters, making it safe for Hindi, Urdu, and Arabic terms.

Longer terms are matched preferentially over shorter ones when they overlap (greedy left-to-right scan).

Forbidden terms

Terms imported with administrativeStatus = deprecatedTerm or supersededTerm are stored as forbidden. The opencat/qa TerminologyConsistencyCheck flags target segments that use a forbidden term instead of its approved equivalent.

Related packages

  • opencat/coreTermEntry, TermMatch, TerminologyProviderInterface, TerminologyException
  • opencat/qaTerminologyConsistencyCheck uses TerminologyProviderInterface
  • opencat/workflow — wires SqliteTerminologyProvider into the pipeline

统计信息

  • 总下载量: 0
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 1
  • 点击次数: 8
  • 依赖项目数: 1
  • 推荐数: 0

GitHub 信息

  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • 开发语言: PHP

其他信息

  • 授权协议: MIT
  • 更新时间: 2026-05-09

承接程序开发

PHP开发

VUE

Vue开发

前端开发

小程序开发

公众号开发

系统定制

数据库设计

云部署

网站建设

安全加固