README

Pure-PHP language detection library using n-gram frequency profiles. Given any input text, returns a ranked list of candidate languages with confidence scores. No external services or compiled extensions required beyond ext-mbstring.

Installation

composer require thingston/language

Usage

use Thingston\Language\LanguageDetector;

$detector = new LanguageDetector();

$results = $detector->detect('Bonjour tout le monde');

$best = $results->best();
echo $best->getCode();       // fr
echo $best->getName();       // French
echo $best->getConfidence(); // ~0.97

foreach ($results->top(5) as $score) {
    echo $score->getCode() . ': ' . round($score->getConfidence() * 100, 1) . '%' . PHP_EOL;
}

Restrict to a subset of languages

$detector = new LanguageDetector(languages: ['en', 'fr', 'de', 'es']);
$best = $detector->detect('The weather is lovely today')->best();
echo $best->getCode(); // en

Custom n-gram sizes

$detector = new LanguageDetector(ngramSizes: [1, 2, 3]);

Custom profile repository

use Thingston\Language\Profile\ProfileRepository;

$detector = new LanguageDetector(profileRepository: new ProfileRepository('/path/to/profiles'));

Supported languages (74)

Code	Language	Code	Language	Code	Language
af	Afrikaans	hi	Hindi	ps	Pashto
am	Amharic	hr	Croatian	pt	Portuguese
ar	Arabic	hu	Hungarian	ro	Romanian
az	Azerbaijani	id	Indonesian	ru	Russian
be	Belarusian	ig	Igbo	si	Sinhala
bg	Bulgarian	it	Italian	sk	Slovak
bn	Bengali	ja	Japanese	sl	Slovenian
bs	Bosnian	kk	Kazakh	so	Somali
ca	Catalan	km	Khmer	sq	Albanian
cs	Czech	kn	Kannada	sr	Serbian
cy	Welsh	ko	Korean	sv	Swedish
da	Danish	lo	Lao	sw	Swahili
de	German	lt	Lithuanian	ta	Tamil
el	Greek	lv	Latvian	te	Telugu
en	English	mk	Macedonian	tg	Tajik
es	Spanish	ml	Malayalam	th	Thai
et	Estonian	mn	Mongolian	tl	Tagalog
eu	Basque	mr	Marathi	tr	Turkish
fa	Persian	ms	Malay	uk	Ukrainian
fi	Finnish	my	Burmese	ur	Urdu
fr	French	ne	Nepali	uz	Uzbek
gl	Galician	nl	Dutch	vi	Vietnamese
gu	Gujarati	no	Norwegian	yo	Yoruba
ha	Hausa	pa	Punjabi	zh	Chinese
he	Hebrew	pl	Polish

How it works

Language profiles are built from Tatoeba sentence corpora (CC-BY 2.0). For each language, the top 1,000 most frequent character n-grams (sizes 1–4) are stored as relative frequencies.

At detection time the input text is normalized (lowercased, non-letter characters stripped), n-grams are extracted, and each language profile is scored with a log-probability sum. Raw scores are converted to confidences via softmax.

Rebuilding profiles

If you want to regenerate language profiles from fresh training data:

# 1. Download Tatoeba sentence corpora (requires ext-bzip2)
php bin/download-corpus.php

# 2. Build PHP profile files
php bin/build-profiles.php

# Optional: restrict to specific languages
php bin/download-corpus.php en fr de
php bin/build-profiles.php en fr de --top=1000 --sizes=1,2,3,4

Development

composer install

# Run unit tests
composer test-unit

# Run accuracy tests (requires built profiles)
composer test-accuracy

# Code style check
composer cs

# Code style fix
composer cs-fix

# Static analysis (PHPStan level 9)
composer stan

License

MIT

thingston/language

包简介

README 文档