neuron-core/llm-classifier
Composer 安装命令:
composer require neuron-core/llm-classifier
包简介
Train and run classifiers that score how hard a prompt is for an LLM to answer.
README 文档
README
Train and run classifiers that score how hard a prompt is for an LLM to answer.
Not every prompt is equally hard. This package builds small, fast classifiers that
map an incoming prompt to a difficulty score (0 = easy, 1 = hard), broken down by
the kind of task — math, writing, and so on. You train the classifier on your own
tasks and your own model lineup, so the score reflects what your models actually
find hard, not a guess.
That score is the missing input for smart routing: once you know how hard a prompt is, you send it to the cheapest model that can handle it — easy requests to a cheap/fast model, only genuinely hard ones to the expensive tier. This is what the classifiers are built for, and they're meant to be plugged into routers like neuron-core/router. Same quality where it matters, far lower bill.
It runs in pure PHP (needs only ext-mbstring) — no Python service, no GPU, no ML
runtime to deploy. Training happens once, offline; scoring runs in microseconds,
before you ever call the LLM.
The idea in one picture
┌───────────────┐
incoming prompt ─────────▶ │ classifier │ ──▶ difficulty score (0 = easy, 1 = hard)
└───────────────┘
│
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
cheap model mid-tier model premium model
(GPT-4o-mini) (GPT-4o) (o1 / Claude Opus)
You decide the thresholds. A prompt like "what are your opening hours?" lands on the cheap model; "draft a non-disclosure agreement under Italian law" lands on the premium one — automatically.
Install
composer require neuron-core/llm-classifier
Requires just PHP 8.1 + ext-mbstring. It works with any provider through
neuron-ai (OpenAI, Anthropic, Gemini,
Mistral, Ollama, …).
How to use it — two phases
There are two things to do, and they happen at very different times:
- Calibrate (once, offline) — teach the classifier what "easy" and "hard" look
like for your tasks and your models. This produces a single
model.binfile. - Score & route (on every request, at runtime) — load
model.binand use the score to pick a model. This is the part that runs in your live app.
Phase 1 — Calibrate (run once, from a script or console command)
You give it three things:
- a panel of your models (they attempt the tasks so we can learn what trips them up),
- a list of sample prompts with the correct answer or a rubric,
- the graders that decide if an answer is correct.
use NeuronAI\Providers\OpenAI\OpenAI; use NeuronCore\Classifier\Calibration\Calibrator; use NeuronCore\Classifier\Calibration\Grader\ExactMatchGrader; use NeuronCore\Classifier\Calibration\Grader\LlmJudgeGrader; use NeuronCore\Classifier\Calibration\GraderResolver; use NeuronCore\Classifier\Calibration\SeedCorpus; // The models we want to route BETWEEN — they take the test. $panel = [ new OpenAI(apiKey: $cheapKey, model: 'gpt-4o-mini'), new OpenAI(apiKey: $premiumKey, model: 'gpt-4o'), ]; // A separate "judge" model grades the answers. Keep it OUT of the panel. $judge = new OpenAI(apiKey: $key, model: 'gpt-4o'); $artifact = (new Calibrator( panel: $panel, corpus: SeedCorpus::fromFile('seed.csv'), graders: new GraderResolver([ // Mechanical check: the answer must match exactly. 'math' => new ExactMatchGrader(), // No single right answer: let the judge compare to a rubric. 'writing' => new LlmJudgeGrader($judge), ]), language: 'en', fasttext: 'cc.en.300.vec', // download once from https://fasttext.cc/docs/en/crawl-vectors.html#models ))->run(); $artifact->writeTo('storage/model.script'); // ship this file with your app
That's it. The output is one model.bin you commit alongside your code. When your
models improve (or your prices change), re-run this with the new panel of models and replace the binary file.
Do I need to understand the math? No. You provide prompts + answers + graders. The classifier figures out which prompt patterns are hard for your models. You never touch any equations.
What is that
cc.en.300.vecfile? A free, downloadable word-vector dictionary from fastText — grabcc.<lang>.300.vec.gz,gunzipit, and point the calibrator at the.vec. (Embeddings Facebook trained on web crawls.) It maps each word to 300 numbers that capture meaning (buyandpurchaseland close together;kingandcarburetorfar apart). The classifier is arithmetic on numbers, not words, so each prompt is first reduced to numbers: every word is looked up in this table and the vectors averaged into one 300-number fingerprint (Embeddings::meanPool) — and that is the model's only input. Only the words your corpus actually uses are kept, so the pruned table gets baked intomodel.binand the fastText file is not needed at runtime. Don't want fastText? Inject your ownEmbeddingSourcewith any vectors you like.
Phase 2 — Score & route (in your live app)
Load the model once and call classify() on each request:
use NeuronCore\Classifier\Classifier; use NeuronAI\Chat\Messages\UserMessage; use NeuronAI\Providers\OpenAI\OpenAI; // Load ONCE — e.g. on app boot, or under Octane/RoadRunner/FrankenPHP workers. $scorer = Classifier::load('storage/model.script'); // Use on every request: // 1) Guard first: how much of this prompt does the classifier actually recognize? // Low coverage = out-of-domain → don't trust the score, send to the strong model. if ($scorer->coverage($userPrompt) < 0.4) { $model = 'o1'; // unfamiliar territory → safest, most capable model } else { // 2) In-domain: route by difficulty. overall() returns ONE score in 0..1. $score = $scorer->overall($userPrompt); // Pick the model that's cheap enough for how easy this prompt is. $model = match (true) { $score < 0.33 => 'gpt-4o-mini', // easy → cheap & fast $score < 0.70 => 'gpt-4o', // medium → solid all-rounder default => 'o1', // hard → most capable }; } $provider = new OpenAI(apiKey: $key, model: $model); $answer = $provider->chat(new UserMessage($userPrompt))->getContent();
That's the whole integration. overall() gives you one number to threshold
against — it's the max of the per-capability scores, i.e. "as hard as the
hardest thing this prompt touches". If you'd rather route differently depending on
the task type, use classify() to get the full per-capability map
(['math' => 0.82, 'writing' => 0.05, ...]) instead.
The score
0 = your panel solved this easily → safe to send to a cheap model.
1 = your panel struggled → send to a capable model.
overall() returns one score — the max across capabilities, not the average.
A mean would let a single hard capability get watered down by all the capabilities
the prompt isn't about; max routes on the hardest thing the prompt actually
touches, which is what you want for cost routing.
There are two knobs to tune, and the numbers in the router above are just a starting point:
- Difficulty cut-offs (
0.33,0.70) — where easy ends and hard begins. - Coverage cut-off (
0.4) — below this, a prompt is treated as out-of-domain and skipped to the premium model regardless of its score.
To tune them: log the difficulty score, the coverage, and the model you would have used for real traffic, then adjust the cut-offs until you like the trade-off between cost and quality. Tighten the coverage cut-off (raise it) if you see out-of-domain prompts leaking through; loosen the difficulty cut-offs (lower the "hard" threshold) if cheap-model answers are coming back wrong.
The seed file
SeedCorpus::fromFile('seed.csv') reads a CSV with one task per line:
| prompt | capability | reference_type | reference | grader | difficulty |
|---|---|---|---|---|---|
| What is 2+2? | math | gold_answer | 4 | ||
| Write a haiku about autumn | writing | rubric | 5-7-5 syllables, seasonal imagery |
| Column | What to put |
|---|---|
prompt |
A representative task you actually receive. |
capability |
A group to train one scorer for (e.g. math, writing). |
reference_type |
gold_answer (one correct answer), rubric (criteria), or none. |
reference |
The expected answer, or the rubric text. |
grader |
Optional: override the grader for just this row. |
difficulty |
Optional: a precomputed 0..1 (higher = harder). See cold-start below. |
Tip: the more your seed prompts resemble your real traffic, the better the routing. A few hundred diverse examples is a solid pool.
Common questions
Is this an LLM call on every request? No. Scoring is pure PHP, microseconds, no network. The LLM calls only happen during the one-time calibration.
Which models should be in the panel? The ones you actually route between. The classifier learns what they find hard, so it routes correctly for your lineup.
Does it work in every language? Yes — pass the matching fastText
file and set language. Subword vectors handle typos and out-of-vocabulary words.
What if a prompt is nothing like my training data? Its difficulty score is
unreliable, so check coverage() first — the fraction of the prompt's words the
classifier recognizes. Low coverage means out-of-domain: skip the score and send
straight to the premium model. The copy-paste router above already does this.
if ($scorer->coverage($userPrompt) < 0.4) { // too many unknown words $model = 'o1'; // don't trust the score → strongest model } else { $score = $scorer->classify($userPrompt)['math'] ?? 0.0; // …route by score… }
Datasets
Ready-to-use precomputed-difficulty seed corpora for the classifier's zero-API-call
cold-start path. Each file is a plain CSV in the format SeedCorpus::fromFile() reads:
prompt, capability, reference_type, reference, grader, difficulty
Because every row carries a difficulty label, calibration needs no model panel and
no graders — only a fastText vector file. See
Cold-starting from a routing benchmark
in the project README.
routerbench.csv
A 1,845-row sample derived from the public
RouterBench benchmark
(withmartian/routerbench, 0-shot variant). RouterBench records, for ~36k prompts, whether
each of 11 of the most used LLMs (gpt-4-1106-preview, gpt-3.5-turbo-1106, claude-v2, claude-v1,
claude-instant-v1, mistralai/mixtral-8x7b-chat, mistralai/mistral-7b-chat,
meta/llama-2-70b-chat, meta/code-llama-instruct-34b-chat, zero-one-ai/Yi-34B-Chat,
WizardLM/WizardLM-13B-V1.2) answered correctly. We turn that into one label per prompt.
| Column | Value |
|---|---|
prompt |
The benchmark prompt (flattened from its [instruction, input] form). |
capability |
general — one shared difficulty head; route on overall(). |
reference_type |
none — nothing to grade, the label is precomputed. |
reference |
(empty) |
grader |
(empty) |
difficulty |
1 − mean(correctness) across all 11 models — i.e. the fraction of models that got the query wrong. 0 = every model solved it (easy); 1 = none did (hard). |
The subset is stratified so it spans the difficulty range (mean 0.58, ~55 % of rows labelled hard) and covers all 86 source benchmarks in RouterBench (MMLU, GSM8K, HellaSwag, ARC, Winogrande, MBPP, MT-Bench, …). It is small enough to calibrate in pure PHP in seconds while staying diverse.
Build a model from it
# 1) one-time: download the fastText vectors curl -O https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz gunzip cc.en.300.vec.gz mv cc.en.300.vec storage/ # 2) Run calibration will generate the model file -> storage/model.bin php script/routerbench.php
Load it at runtime:
use NeuronCore\Classifier\Classifier; $scorer = Classifier::load('storage/model.bin'); $score = $scorer->overall($userPrompt); // 0 = easy, 1 = hard
Provenance & license
Derived from withmartian/routerbench on Hugging Face (0-shot pickle). RouterBench is
released by Withmartian under its own terms — please review the dataset card for attribution
and licensing before redistributing. This file is a transformed, down-sampled subset
produced for calibration convenience.
Development
composer format # rector + php-cs-fixer composer analyse # PHPStan level 5 + 100% type coverage composer test # PHPUnit
License
MIT
统计信息
- 总下载量: 0
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 0
- 点击次数: 3
- 依赖项目数: 1
- 推荐数: 0
其他信息
- 授权协议: MIT
- 更新时间: 2026-06-16