thingston/language
Composer 安装命令:
composer require thingston/language
包简介
Pure-PHP language detection library using n-gram profiles
README 文档
README
Pure-PHP language detection library using n-gram frequency profiles. Given any input text, returns a ranked list of candidate languages with confidence scores. No external services or compiled extensions required beyond ext-mbstring.
Installation
composer require thingston/language
Usage
use Thingston\Language\LanguageDetector; $detector = new LanguageDetector(); $results = $detector->detect('Bonjour tout le monde'); $best = $results->best(); echo $best->getCode(); // fr echo $best->getName(); // French echo $best->getConfidence(); // ~0.97 foreach ($results->top(5) as $score) { echo $score->getCode() . ': ' . round($score->getConfidence() * 100, 1) . '%' . PHP_EOL; }
Restrict to a subset of languages
$detector = new LanguageDetector(languages: ['en', 'fr', 'de', 'es']); $best = $detector->detect('The weather is lovely today')->best(); echo $best->getCode(); // en
Custom n-gram sizes
$detector = new LanguageDetector(ngramSizes: [1, 2, 3]);
Custom profile repository
use Thingston\Language\Profile\ProfileRepository; $detector = new LanguageDetector(profileRepository: new ProfileRepository('/path/to/profiles'));
Supported languages (74)
| Code | Language | Code | Language | Code | Language |
|---|---|---|---|---|---|
| af | Afrikaans | hi | Hindi | ps | Pashto |
| am | Amharic | hr | Croatian | pt | Portuguese |
| ar | Arabic | hu | Hungarian | ro | Romanian |
| az | Azerbaijani | id | Indonesian | ru | Russian |
| be | Belarusian | ig | Igbo | si | Sinhala |
| bg | Bulgarian | it | Italian | sk | Slovak |
| bn | Bengali | ja | Japanese | sl | Slovenian |
| bs | Bosnian | kk | Kazakh | so | Somali |
| ca | Catalan | km | Khmer | sq | Albanian |
| cs | Czech | kn | Kannada | sr | Serbian |
| cy | Welsh | ko | Korean | sv | Swedish |
| da | Danish | lo | Lao | sw | Swahili |
| de | German | lt | Lithuanian | ta | Tamil |
| el | Greek | lv | Latvian | te | Telugu |
| en | English | mk | Macedonian | tg | Tajik |
| es | Spanish | ml | Malayalam | th | Thai |
| et | Estonian | mn | Mongolian | tl | Tagalog |
| eu | Basque | mr | Marathi | tr | Turkish |
| fa | Persian | ms | Malay | uk | Ukrainian |
| fi | Finnish | my | Burmese | ur | Urdu |
| fr | French | ne | Nepali | uz | Uzbek |
| gl | Galician | nl | Dutch | vi | Vietnamese |
| gu | Gujarati | no | Norwegian | yo | Yoruba |
| ha | Hausa | pa | Punjabi | zh | Chinese |
| he | Hebrew | pl | Polish |
How it works
Language profiles are built from Tatoeba sentence corpora (CC-BY 2.0). For each language, the top 1,000 most frequent character n-grams (sizes 1–4) are stored as relative frequencies.
At detection time the input text is normalized (lowercased, non-letter characters stripped), n-grams are extracted, and each language profile is scored with a log-probability sum. Raw scores are converted to confidences via softmax.
Rebuilding profiles
If you want to regenerate language profiles from fresh training data:
# 1. Download Tatoeba sentence corpora (requires ext-bzip2) php bin/download-corpus.php # 2. Build PHP profile files php bin/build-profiles.php # Optional: restrict to specific languages php bin/download-corpus.php en fr de php bin/build-profiles.php en fr de --top=1000 --sizes=1,2,3,4
Development
composer install # Run unit tests composer test-unit # Run accuracy tests (requires built profiles) composer test-accuracy # Code style check composer cs # Code style fix composer cs-fix # Static analysis (PHPStan level 9) composer stan
License
MIT
统计信息
- 总下载量: 0
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 0
- 点击次数: 2
- 依赖项目数: 0
- 推荐数: 0
其他信息
- 授权协议: MIT
- 更新时间: 2026-06-19