carmelosantana/coqui-harbor-external
最新稳定版本:v0.1.0
Composer 安装命令:
composer require carmelosantana/coqui-harbor-external
包简介
Harbor benchmarking toolkit for Coqui — task management, eval execution, and result analysis via the Harbor CLI
README 文档
README
Harbor benchmarking toolkit for Coqui. Manage tasks, run evaluations, and analyze benchmark results via the Harbor CLI.
Requirements
- PHP 8.4+
- Harbor CLI (
uv tool install harbor) - Docker (for local evaluations)
- Coqui
Installation
composer require carmelosantana/coqui-harbor-external
The toolkit is auto-discovered by Coqui — no code changes needed.
Tools Provided
Discovery & Validation
| Tool | Description |
|---|---|
harbor_check |
Verify Harbor CLI, Python, Docker, and uv are installed |
harbor_task_validate |
Validate a task directory has the required structure |
harbor_dataset_list |
List registered datasets from the Harbor registry |
Task Authoring
| Tool | Description |
|---|---|
harbor_task_init |
Scaffold a new task directory (instruction.md, task.toml, environment/, tests/) |
harbor_task_list |
List all tasks in a local dataset directory |
harbor_task_delete |
Delete a task directory (gated — requires confirmation) |
Execution
| Tool | Description |
|---|---|
harbor_run |
Run a Harbor evaluation against a dataset or task path (gated) |
harbor_run_status |
Check job progress (trial completion, overall status) |
harbor_view |
Launch Harbor's web-based results viewer |
Analysis
| Tool | Description |
|---|---|
harbor_results |
Parse job results: pass/fail, reward distribution, durations |
harbor_trial_inspect |
Inspect a trial's trajectory, verifier logs, and reward |
harbor_compare |
Compare two or more jobs for regression detection |
harbor_failures |
Extract failed trials with root cause details |
harbor_cleanup |
Delete old job directories (gated) |
Python Agent Wrapper
The package includes a Python external agent that bridges Harbor's evaluation framework with Coqui's CLI. This allows Harbor to drive Coqui as the agent under test.
Setup
cd agent uv pip install -e .
Usage
harbor run \ -p ./my-tasks \ --agent-import-path coqui_harbor_agent.agent:CoquiExternalAgent \ -m anthropic/claude-sonnet-4-20250514
Configuration
| Environment Variable | Default | Description |
|---|---|---|
COQUI_BIN |
coqui |
Path to the Coqui binary |
COQUI_TIMEOUT |
600 |
Max seconds per task |
COQUI_MAX_ITERATIONS |
100 |
Agent iteration limit |
COQUI_MODEL |
(from Harbor -m) | Model override |
COQUI_ROLE |
coder |
Agent role |
COQUI_AUTO_APPROVE |
true |
Auto-approve tool calls |
COQUI_EXTRA_ARGS |
Additional CLI arguments |
Bundled Skill
The harbor-benchmarking skill provides an operational SOP for running benchmark campaigns — including task creation, evaluation execution, failure triage, regression detection, and reporting. It is auto-discovered when the package is installed.
Bundled Loop
The benchmark loop definition automates a full benchmark cycle:
- Plan — validate tasks, define success criteria, create plan artifact
- Coder — execute benchmark runs, analyze results, create report artifact
- Reviewer — verify completeness, check for regressions, approve or request changes
Terminates when the reviewer responds with APPROVED.
Development
composer install composer test # Run Pest tests composer analyse # Run PHPStan (level 8)
License
MIT
统计信息
- 总下载量: 0
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 0
- 点击次数: 6
- 依赖项目数: 0
- 推荐数: 0
其他信息
- 授权协议: MIT
- 更新时间: 2026-04-08