neuron-core/classifier
Composer 安装命令:
composer require neuron-core/classifier
包简介
Model-Agnostic Difficulty Classifier
README 文档
README
Spend less on LLMs by sending each request to the cheapest model that can handle it.
If you build AI features, you probably default everything to your most capable (and most expensive) model — because the one time you don't, a user gets a bad answer. This package looks at an incoming prompt and scores how hard it is. You use that score to route: simple requests go to a cheap/fast model, only genuinely hard requests go to the expensive one. Same quality where it matters, far lower bill.
It runs in pure PHP (needs only ext-mbstring) — no Python service, no GPU, no ML
runtime to deploy. It scores in microseconds, before you ever call the LLM.
The idea in one picture
┌───────────────┐
incoming prompt ─────────▶ │ classifier │ ──▶ difficulty score (0 = easy, 1 = hard)
└───────────────┘
│
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
cheap model mid-tier model premium model
(GPT-4o-mini) (GPT-4o) (o1 / Claude Opus)
You decide the thresholds. A prompt like "what are your opening hours?" lands on the cheap model; "draft a non-disclosure agreement under Italian law" lands on the premium one — automatically.
Install
composer require neuron-core/classifier
Requires PHP 8.2+ and ext-mbstring. It works with any provider through
neuron-ai (OpenAI, Anthropic, Gemini,
Mistral, Ollama, …).
How to use it — two phases
There are two things to do, and they happen at very different times:
- Calibrate (once, offline) — teach the classifier what "easy" and "hard" look
like for your tasks and your models. This produces a single
model.binfile. - Score & route (on every request, at runtime) — load
model.binand use the score to pick a model. This is the part that runs in your live app.
Phase 1 — Calibrate (run once, from a script or console command)
You give it three things:
- a panel of your models (they attempt the tasks so we can learn what trips them up),
- a list of sample prompts with the correct answer or a rubric,
- the graders that decide if an answer is correct.
use NeuronAI\Providers\OpenAI\OpenAIProvider; use NeuronCore\Classifier\Calibration\Calibrator; use NeuronCore\Classifier\Calibration\Grader\ExactMatchGrader; use NeuronCore\Classifier\Calibration\Grader\LlmJudgeGrader; use NeuronCore\Classifier\Calibration\GraderResolver; use NeuronCore\Classifier\Calibration\SeedCorpus; // The models we want to route BETWEEN — they take the test. $panel = [ new OpenAIProvider(apiKey: $cheapKey, model: 'gpt-4o-mini'), new OpenAIProvider(apiKey: $premiumKey, model: 'gpt-4o'), ]; // A separate "judge" model grades the answers. Keep it OUT of the panel. $judge = new OpenAIProvider(apiKey: $key, model: 'gpt-4o'); $artifact = (new Calibrator( panel: $panel, corpus: SeedCorpus::fromFile('seed.csv'), graders: new GraderResolver([ // Mechanical check: the answer must match exactly. 'math' => new ExactMatchGrader(), // No single right answer: let the judge compare to a rubric. 'writing' => new LlmJudgeGrader($judge), ]), language: 'en', fasttext: 'cc.en.300.vec', // download once from https://fasttext.cc/ ))->run(); $artifact->writeTo('storage/model.bin'); // ship this file with your app
That's it. The output is one model.bin you commit alongside your code. When your
models improve (or your prices change), re-run this with the new panel of models and replace the binary file.
Do I need to understand the math? No. You provide prompts + answers + graders. The classifier figures out which prompt patterns are hard for your models. You never touch any equations.
What is that
cc.en.300.vecfile? A free, downloadable word-vector dictionary from fastText — grabcc.<lang>.300.vec.gzfrom the crawl-vectors page,gunzipit, and point the calibrator at the.vec. (Embeddings Facebook trained on web crawls.) It maps each word to 300 numbers that capture meaning (buyandpurchaseland close together;kingandcarburetorfar apart). The classifier is arithmetic on numbers, not words, so each prompt is first reduced to numbers: every word is looked up in this table and the vectors averaged into one 300-number fingerprint (Embeddings::meanPool) — and that is the model's only input. Only the words your corpus actually uses are kept, so the pruned table gets baked intomodel.binand the fastText file is not needed at runtime. Don't want fastText? Inject your ownEmbeddingSourcewith any vectors you like.
Phase 2 — Score & route (in your live app)
Load the model once and call classify() on each request:
use NeuronCore\Classifier\Classifier; use NeuronAI\Chat\Messages\UserMessage; use NeuronAI\Providers\OpenAI\OpenAI; // Load ONCE — e.g. on app boot, or under Octane/RoadRunner/FrankenPHP workers. $scorer = Classifier::load('storage/model.bin'); // Use on every request: // 1) Guard first: how much of this prompt does the classifier actually recognize? // Low coverage = out-of-domain → don't trust the score, send to the strong model. if ($scorer->coverage($userPrompt) < 0.4) { $model = 'o1'; // unfamiliar territory → safest, most capable model } else { // 2) In-domain: route by difficulty. overall() returns ONE score in 0..1. $score = $scorer->overall($userPrompt); // Pick the model that's cheap enough for how easy this prompt is. $model = match (true) { $score < 0.33 => 'gpt-4o-mini', // easy → cheap & fast $score < 0.70 => 'gpt-4o', // medium → solid all-rounder default => 'o1', // hard → most capable }; } $provider = new OpenAI(apiKey: $key, model: $model); $answer = $provider->chat(new UserMessage($userPrompt))->getContent();
That's the whole integration. overall() gives you one number to threshold
against — it's the max of the per-capability scores, i.e. "as hard as the
hardest thing this prompt touches". If you'd rather route differently depending on
the task type, use classify() to get the full per-capability map
(['math' => 0.82, 'writing' => 0.05, ...]) instead.
The seed file
SeedCorpus::fromFile('seed.csv') reads a CSV with one task per line:
| prompt | capability | reference_type | reference | grader | difficulty |
|---|---|---|---|---|---|
| What is 2+2? | math | gold_answer | 4 | ||
| Write a haiku about autumn | writing | rubric | 5-7-5 syllables, seasonal imagery |
| Column | What to put |
|---|---|
prompt |
A representative task you actually receive. |
capability |
A group to train one scorer for (e.g. math, writing). |
reference_type |
gold_answer (one correct answer), rubric (criteria), or none. |
reference |
The expected answer, or the rubric text. |
grader |
Optional: override the grader for just this row. |
difficulty |
Optional: a precomputed 0..1 (higher = harder). See cold-start below. |
Tip: the more your seed prompts resemble your real traffic, the better the routing. A few hundred diverse examples is a solid pool.
Cold-starting from a routing benchmark
The Phase 1 calibration above spends API calls because the panel attempts every seed prompt so the classifier can learn what trips it up. If you already hold labelled data — for example a public LLM-routing benchmark like RouterBench, which records, for thousands of queries, which models answered correctly — you can feed those outcomes in directly and skip the panel entirely.
Add a difficulty to each row (0...1, higher = harder; you derive it however you
like from the benchmark — e.g. the fraction of your cheap-tier models that got the
query wrong). For a plain cost router, tag every row with a single general
capability: that trains one difficulty head and you route on overall() —
exactly the single-axis routing RouterBench benchmarks.
| prompt | capability | reference_type | reference | grader | difficulty |
|---|---|---|---|---|---|
| What is the capital of France? | general | gold_answer | Paris | 0.0 | |
| Derive the Black-Scholes PDE | general | rubric | risk-neutral pricing | 1.0 | |
| Summarize this article | general | rubric | faithful, concise | 0.3 |
(You're not forced to use one capability — split rows into math, writing, … if
you want per-task-type heads. But you don't have to: one shared capability is the
simplest faithful cold-start.)
When every row carries a difficulty, pass an empty panel and empty
graders. No row needs the panel, so calibration makes zero API calls:
use NeuronCore\Classifier\Calibration\Calibrator; use NeuronCore\Classifier\Calibration\GraderResolver; use NeuronCore\Classifier\Calibration\SeedCorpus; $model = (new Calibrator( panel: [], // no test-takers — labels are precomputed corpus: SeedCorpus::fromFile('routerbench.csv'), graders: new GraderResolver([]), // nothing to grade language: 'en', fasttext: 'cc.en.300.vec', // still need vectors; or inject EmbeddingSource ))->run(); $model->writeTo('model.bin'); // ships exactly like a panel-calibrated model
Labels vs. vectors — two different inputs, don't confuse them. The
difficultycolumn is the label (how hard the prompt is); supplying it is what lets you drop the panel.EmbeddingSource(or thefasttextpath) is only the vector table — where each word's numeric representation comes from. You can swap that too — e.g. inject your own embeddings — but swapping vectors alone does not skip the panel; only precomputeddifficultydoes.
You can also mix the two: leave difficulty blank on rows you want the panel
to grade, and pre-fill it on the rest. A row with a difficulty is always labelled
from that value; a row without one falls back to the panel.
What the score means
0 = your panel solved this easily → safe to send to a cheap model.
1 = your panel struggled → send to a capable model.
overall() returns one score — the max across capabilities, not the average.
A mean would let a single hard capability get watered down by all the capabilities
the prompt isn't about; max routes on the hardest thing the prompt actually
touches, which is what you want for cost routing.
There are two knobs to tune, and the numbers in the router above are just a starting point:
- Difficulty cut-offs (
0.33,0.70) — where easy ends and hard begins. - Coverage cut-off (
0.4) — below this, a prompt is treated as out-of-domain and skipped to the premium model regardless of its score.
To tune them: log the difficulty score, the coverage, and the model you would have used for real traffic, then adjust the cut-offs until you like the trade-off between cost and quality. Tighten the coverage cut-off (raise it) if you see out-of-domain prompts leaking through; loosen the difficulty cut-offs (lower the "hard" threshold) if cheap-model answers are coming back wrong.
Common questions
Is this an LLM call on every request? No. Scoring is pure PHP, microseconds, no network. The LLM calls only happen during the one-time calibration.
Which models should be in the panel? The ones you actually route between. The classifier learns what they find hard, so it routes correctly for your lineup.
Does it work in every language? Yes — pass the matching fastText
file and set language. Subword vectors handle typos and out-of-vocabulary words.
What if a prompt is nothing like my training data? Its difficulty score is
unreliable, so check coverage() first — the fraction of the prompt's words the
classifier recognizes. Low coverage means out-of-domain: skip the score and send
straight to the premium model. The copy-paste router above already does this.
if ($scorer->coverage($userPrompt) < 0.4) { // too many unknown words $model = 'o1'; // don't trust the score → strongest model } else { $score = $scorer->classify($userPrompt)['math'] ?? 0.0; // …route by score… }
Going further
- Multiple capabilities:
classify()returns a score per capability (['math' => …, 'writing' => …]). Route on the relevant one, or combine them. - Custom graders: anything implementing the one-method
Gradercontract works — e.g. a test runner that executes generated code and checks it passes. - Precomputed labels: supply a
difficultyper row to cold-start from a routing benchmark like RouterBench with zero API calls — see Cold-starting from a routing benchmark. - Calibration options:
CalibratoracceptshardThreshold(default0.5), an injectableembeddingSource(swap the vector source), and a customtokenizer.
Development
composer format # rector + php-cs-fixer composer analyse # PHPStan level 5 + 100% type coverage composer test # PHPUnit
Internals (for the curious)
- The classifier is a mean-pooled static word embedding fed into one tiny logistic model per capability. Heavy work (the panel, graders, fitting) happens only at calibration; the runtime is just a lookup + average + a sigmoid.
- The model file (
model.bin) is a versioned binary blob: meta, the pruned embedding table, and each capability's weights + bias. The runtime reads only this file. Calibratoris the only class that talks toAIProviderInterface; the resulting model is model-agnostic.
License
MIT
统计信息
- 总下载量: 0
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 0
- 点击次数: 1
- 依赖项目数: 1
- 推荐数: 1
其他信息
- 授权协议: MIT
- 更新时间: 2026-06-16