Benchmark, evaluate, and compare language model outputs.
Evaluate and benchmark LLM outputs with structured test suites, scoring rubrics, and regression detection — ensure your AI features maintain quality over time.
claude skill install llm-eval-kit
# Create an eval suite for a classification prompt
/llm-eval-kit create --task "ticket-classification" --samples 20
# Run evaluation and generate a report
/llm-eval-kit run --suite evals/ticket-classification.json --output report.html
Builds evaluation datasets from your prompt use cases with expected outputs and scoring criteria. Runs each test case through your prompt pipeline and scores results against the rubric. Tracks scores over time to detect regressions and generates visual reports comparing prompt versions and model performance.
Works with Claude Code, Codex, Gemini CLI, Cursor, Windsurf, and Aider.