README

Spend less on LLMs by sending each request to the cheapest model that can handle it.

If you build AI features, you probably default everything to your most capable (and most expensive) model — because the one time you don't, a user gets a bad answer. This package looks at an incoming prompt and scores how hard it is. You use that score to route: simple requests go to a cheap/fast model, only genuinely hard requests go to the expensive one. Same quality where it matters, far lower bill.

It runs in pure PHP (needs only ext-mbstring) — no Python service, no GPU, no ML runtime to deploy. It scores in microseconds, before you ever call the LLM.

The idea in one picture

                              ┌───────────────┐
   incoming prompt ─────────▶ │  classifier   │ ──▶ difficulty score (0 = easy, 1 = hard)
                              └───────────────┘
                                       │
                 ┌─────────────────────┼─────────────────────┐
                 ▼                     ▼                      ▼
          cheap model           mid-tier model          premium model
            (GPT-4o-mini)        (GPT-4o)               (o1 / Claude Opus)

You decide the thresholds. A prompt like "what are your opening hours?" lands on the cheap model; "draft a non-disclosure agreement under Italian law" lands on the premium one — automatically.

Install

composer require neuron-core/classifier

Requires PHP 8.2+ and ext-mbstring. It works with any provider through neuron-ai (OpenAI, Anthropic, Gemini, Mistral, Ollama, …).

How to use it — two phases

There are two things to do, and they happen at very different times:

Calibrate (once, offline) — teach the classifier what "easy" and "hard" look like for your tasks and your models. This produces a single model.bin file.
Score & route (on every request, at runtime) — load model.bin and use the score to pick a model. This is the part that runs in your live app.

Phase 1 — Calibrate (run once, from a script or console command)

You give it three things:

a panel of your models (they attempt the tasks so we can learn what trips them up),
a list of sample prompts with the correct answer or a rubric,
the graders that decide if an answer is correct.

use NeuronAI\Providers\OpenAI\OpenAIProvider;
use NeuronCore\Classifier\Calibration\Calibrator;
use NeuronCore\Classifier\Calibration\Grader\ExactMatchGrader;
use NeuronCore\Classifier\Calibration\Grader\LlmJudgeGrader;
use NeuronCore\Classifier\Calibration\GraderResolver;
use NeuronCore\Classifier\Calibration\SeedCorpus;

// The models we want to route BETWEEN — they take the test.
$panel = [
    new OpenAIProvider(apiKey: $cheapKey,  model: 'gpt-4o-mini'),
    new OpenAIProvider(apiKey: $premiumKey, model: 'gpt-4o'),
];

// A separate "judge" model grades the answers. Keep it OUT of the panel.
$judge = new OpenAIProvider(apiKey: $key, model: 'gpt-4o');

$artifact = (new Calibrator(
    panel:    $panel,
    corpus:   SeedCorpus::fromFile('seed.csv'),
    graders:  new GraderResolver([
        // Mechanical check: the answer must match exactly.
        'math' => new ExactMatchGrader(),
        // No single right answer: let the judge compare to a rubric.
        'writing' => new LlmJudgeGrader($judge),
    ]),
    language: 'en',
    fasttext: 'cc.en.300.vec',   // download once from https://fasttext.cc/
))->run();

$artifact->writeTo('storage/model.bin'); // ship this file with your app

That's it. The output is one model.bin you commit alongside your code. When your models improve (or your prices change), re-run this with the new panel of models and replace the binary file.

Do I need to understand the math? No. You provide prompts + answers + graders. The classifier figures out which prompt patterns are hard for your models. You never touch any equations.

What is that cc.en.300.vec file? A free, downloadable word-vector dictionary from fastText — grab cc.<lang>.300.vec.gz from the crawl-vectors page, gunzip it, and point the calibrator at the .vec. (Embeddings Facebook trained on web crawls.) It maps each word to 300 numbers that capture meaning (buy and purchase land close together; king and carburetor far apart). The classifier is arithmetic on numbers, not words, so each prompt is first reduced to numbers: every word is looked up in this table and the vectors averaged into one 300-number fingerprint (Embeddings::meanPool) — and that is the model's only input. Only the words your corpus actually uses are kept, so the pruned table gets baked into model.bin and the fastText file is not needed at runtime. Don't want fastText? Inject your own EmbeddingSource with any vectors you like.

Phase 2 — Score & route (in your live app)

Load the model once and call classify() on each request:

use NeuronCore\Classifier\Classifier;
use NeuronAI\Chat\Messages\UserMessage;
use NeuronAI\Providers\OpenAI\OpenAI;

// Load ONCE — e.g. on app boot, or under Octane/RoadRunner/FrankenPHP workers.
$scorer = Classifier::load('storage/model.bin');

// Use on every request:

// 1) Guard first: how much of this prompt does the classifier actually recognize?
//    Low coverage = out-of-domain → don't trust the score, send to the strong model.
if ($scorer->coverage($userPrompt) < 0.4) {
    $model = 'o1';   // unfamiliar territory → safest, most capable model
} else {
    // 2) In-domain: route by difficulty. overall() returns ONE score in 0..1.
    $score = $scorer->overall($userPrompt);

    // Pick the model that's cheap enough for how easy this prompt is.
    $model = match (true) {
        $score < 0.33 => 'gpt-4o-mini',   // easy  → cheap & fast
        $score < 0.70 => 'gpt-4o',        // medium → solid all-rounder
        default       => 'o1',            // hard  → most capable
    };
}

$provider = new OpenAI(apiKey: $key, model: $model);
$answer = $provider->chat(new UserMessage($userPrompt))->getContent();

That's the whole integration. overall() gives you one number to threshold against — it's the max of the per-capability scores, i.e. "as hard as the hardest thing this prompt touches". If you'd rather route differently depending on the task type, use classify() to get the full per-capability map (['math' => 0.82, 'writing' => 0.05, ...]) instead.

The seed file

SeedCorpus::fromFile('seed.csv') reads a CSV with one task per line:

prompt	capability	reference_type	reference	grader	difficulty
What is 2+2?	math	gold_answer	4
Write a haiku about autumn	writing	rubric	5-7-5 syllables, seasonal imagery

Column	What to put
`prompt`	A representative task you actually receive.
`capability`	A group to train one scorer for (e.g. `math`, `writing`).
`reference_type`	`gold_answer` (one correct answer), `rubric` (criteria), or `none`.
`reference`	The expected answer, or the rubric text.
`grader`	Optional: override the grader for just this row.
`difficulty`	Optional: a precomputed `0..1` (higher = harder). See cold-start below.

Tip: the more your seed prompts resemble your real traffic, the better the routing. A few hundred diverse examples is a solid pool.

Cold-starting from a routing benchmark

The Phase 1 calibration above spends API calls because the panel attempts every seed prompt so the classifier can learn what trips it up. If you already hold labelled data — for example a public LLM-routing benchmark like RouterBench, which records, for thousands of queries, which models answered correctly — you can feed those outcomes in directly and skip the panel entirely.

Add a difficulty to each row (0...1, higher = harder; you derive it however you like from the benchmark — e.g. the fraction of your cheap-tier models that got the query wrong). For a plain cost router, tag every row with a single general capability: that trains one difficulty head and you route on overall() — exactly the single-axis routing RouterBench benchmarks.

prompt	capability	reference_type	reference	difficulty
What is the capital of France?	general	gold_answer	Paris	0.0
Derive the Black-Scholes PDE	general	rubric	risk-neutral pricing	1.0
Summarize this article	general	rubric	faithful, concise	0.3

(You're not forced to use one capability — split rows into math, writing, … if you want per-task-type heads. But you don't have to: one shared capability is the simplest faithful cold-start.)

When every row carries a difficulty, pass an empty panel and empty graders. No row needs the panel, so calibration makes zero API calls:

use NeuronCore\Classifier\Calibration\Calibrator;
use NeuronCore\Classifier\Calibration\GraderResolver;
use NeuronCore\Classifier\Calibration\SeedCorpus;

$model = (new Calibrator(
    panel:    [],                                  // no test-takers — labels are precomputed
    corpus:   SeedCorpus::fromFile('routerbench.csv'),
    graders:  new GraderResolver([]),              // nothing to grade
    language: 'en',
    fasttext: 'cc.en.300.vec',                     // still need vectors; or inject EmbeddingSource
))->run();

$model->writeTo('model.bin');                   // ships exactly like a panel-calibrated model

Labels vs. vectors — two different inputs, don't confuse them. The difficulty column is the label (how hard the prompt is); supplying it is what lets you drop the panel. EmbeddingSource (or the fasttext path) is only the vector table — where each word's numeric representation comes from. You can swap that too — e.g. inject your own embeddings — but swapping vectors alone does not skip the panel; only precomputed difficulty does.

You can also mix the two: leave difficulty blank on rows you want the panel to grade, and pre-fill it on the rest. A row with a difficulty is always labelled from that value; a row without one falls back to the panel.

What the score means

0 = your panel solved this easily → safe to send to a cheap model. 1 = your panel struggled → send to a capable model.

overall() returns one score — the max across capabilities, not the average. A mean would let a single hard capability get watered down by all the capabilities the prompt isn't about; max routes on the hardest thing the prompt actually touches, which is what you want for cost routing.

There are two knobs to tune, and the numbers in the router above are just a starting point:

Difficulty cut-offs (0.33, 0.70) — where easy ends and hard begins.
Coverage cut-off (0.4) — below this, a prompt is treated as out-of-domain and skipped to the premium model regardless of its score.

To tune them: log the difficulty score, the coverage, and the model you would have used for real traffic, then adjust the cut-offs until you like the trade-off between cost and quality. Tighten the coverage cut-off (raise it) if you see out-of-domain prompts leaking through; loosen the difficulty cut-offs (lower the "hard" threshold) if cheap-model answers are coming back wrong.

Common questions

Is this an LLM call on every request? No. Scoring is pure PHP, microseconds, no network. The LLM calls only happen during the one-time calibration.

Which models should be in the panel? The ones you actually route between. The classifier learns what they find hard, so it routes correctly for your lineup.

Does it work in every language? Yes — pass the matching fastText file and set language. Subword vectors handle typos and out-of-vocabulary words.

What if a prompt is nothing like my training data? Its difficulty score is unreliable, so check coverage() first — the fraction of the prompt's words the classifier recognizes. Low coverage means out-of-domain: skip the score and send straight to the premium model. The copy-paste router above already does this.

if ($scorer->coverage($userPrompt) < 0.4) {   // too many unknown words
    $model = 'o1';                              // don't trust the score → strongest model
} else {
    $score = $scorer->classify($userPrompt)['math'] ?? 0.0;
    // …route by score…
}

Going further

Multiple capabilities: classify() returns a score per capability (['math' => …, 'writing' => …]). Route on the relevant one, or combine them.
Custom graders: anything implementing the one-method Grader contract works — e.g. a test runner that executes generated code and checks it passes.
Precomputed labels: supply a difficulty per row to cold-start from a routing benchmark like RouterBench with zero API calls — see Cold-starting from a routing benchmark.
Calibration options: Calibrator accepts hardThreshold (default 0.5), an injectable embeddingSource (swap the vector source), and a custom tokenizer.

Development

composer format    # rector + php-cs-fixer
composer analyse   # PHPStan level 5 + 100% type coverage
composer test      # PHPUnit

Internals (for the curious)

The classifier is a mean-pooled static word embedding fed into one tiny logistic model per capability. Heavy work (the panel, graders, fitting) happens only at calibration; the runtime is just a lookup + average + a sigmoid.
The model file (model.bin) is a versioned binary blob: meta, the pruned embedding table, and each capability's weights + bias. The runtime reads only this file.
Calibrator is the only class that talks to AIProviderInterface; the resulting model is model-agnostic.

License

MIT

neuron-core/classifier

包简介

README 文档