thingston/language 问题修复 & 功能扩展

解决BUG、新增功能、兼容多环境部署,快速响应你的开发需求

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

thingston/language

Composer 安装命令:

composer require thingston/language

包简介

Pure-PHP language detection library using n-gram profiles

README 文档

README

Pure-PHP language detection library using n-gram frequency profiles. Given any input text, returns a ranked list of candidate languages with confidence scores. No external services or compiled extensions required beyond ext-mbstring.

Installation

composer require thingston/language

Usage

use Thingston\Language\LanguageDetector;

$detector = new LanguageDetector();

$results = $detector->detect('Bonjour tout le monde');

$best = $results->best();
echo $best->getCode();       // fr
echo $best->getName();       // French
echo $best->getConfidence(); // ~0.97

foreach ($results->top(5) as $score) {
    echo $score->getCode() . ': ' . round($score->getConfidence() * 100, 1) . '%' . PHP_EOL;
}

Restrict to a subset of languages

$detector = new LanguageDetector(languages: ['en', 'fr', 'de', 'es']);
$best = $detector->detect('The weather is lovely today')->best();
echo $best->getCode(); // en

Custom n-gram sizes

$detector = new LanguageDetector(ngramSizes: [1, 2, 3]);

Custom profile repository

use Thingston\Language\Profile\ProfileRepository;

$detector = new LanguageDetector(profileRepository: new ProfileRepository('/path/to/profiles'));

Supported languages (74)

Code Language Code Language Code Language
af Afrikaans hi Hindi ps Pashto
am Amharic hr Croatian pt Portuguese
ar Arabic hu Hungarian ro Romanian
az Azerbaijani id Indonesian ru Russian
be Belarusian ig Igbo si Sinhala
bg Bulgarian it Italian sk Slovak
bn Bengali ja Japanese sl Slovenian
bs Bosnian kk Kazakh so Somali
ca Catalan km Khmer sq Albanian
cs Czech kn Kannada sr Serbian
cy Welsh ko Korean sv Swedish
da Danish lo Lao sw Swahili
de German lt Lithuanian ta Tamil
el Greek lv Latvian te Telugu
en English mk Macedonian tg Tajik
es Spanish ml Malayalam th Thai
et Estonian mn Mongolian tl Tagalog
eu Basque mr Marathi tr Turkish
fa Persian ms Malay uk Ukrainian
fi Finnish my Burmese ur Urdu
fr French ne Nepali uz Uzbek
gl Galician nl Dutch vi Vietnamese
gu Gujarati no Norwegian yo Yoruba
ha Hausa pa Punjabi zh Chinese
he Hebrew pl Polish

How it works

Language profiles are built from Tatoeba sentence corpora (CC-BY 2.0). For each language, the top 1,000 most frequent character n-grams (sizes 1–4) are stored as relative frequencies.

At detection time the input text is normalized (lowercased, non-letter characters stripped), n-grams are extracted, and each language profile is scored with a log-probability sum. Raw scores are converted to confidences via softmax.

Rebuilding profiles

If you want to regenerate language profiles from fresh training data:

# 1. Download Tatoeba sentence corpora (requires ext-bzip2)
php bin/download-corpus.php

# 2. Build PHP profile files
php bin/build-profiles.php

# Optional: restrict to specific languages
php bin/download-corpus.php en fr de
php bin/build-profiles.php en fr de --top=1000 --sizes=1,2,3,4

Development

composer install

# Run unit tests
composer test-unit

# Run accuracy tests (requires built profiles)
composer test-accuracy

# Code style check
composer cs

# Code style fix
composer cs-fix

# Static analysis (PHPStan level 9)
composer stan

License

MIT

统计信息

  • 总下载量: 0
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 0
  • 点击次数: 2
  • 依赖项目数: 0
  • 推荐数: 0

GitHub 信息

  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • 开发语言: PHP

其他信息

  • 授权协议: MIT
  • 更新时间: 2026-06-19

承接程序开发

PHP开发

VUE

Vue开发

前端开发

小程序开发

公众号开发

系统定制

数据库设计

云部署

网站建设

安全加固