Self-checks that keep the assistant honest
Run unit-test stubs, rubric scoring, and assertions. Gate outputs on passing checks.
What Evaluation & QA does
Evaluation & QA provides self-checking capabilities that verify outputs before they're delivered. Run unit-test stubs, score against rubrics, and assert conditions. Gate high-risk outputs on passing checks to reduce errors and improve reliability.
Core capabilities
- Run unit-test stubs
- Rubric scoring for quality
- Assertions and conditions
- Gate outputs on pass/fail
Use cases
- Verify code patches build and pass tests
- Score summaries against structure rubrics
- Assert conditions before sending outputs
Who benefits from Evaluation & QA
Individuals
Fewer errors in outputs
Teams & Managers
Quality gates for repeatable runs
Developers & IT
Hook tests into plans
Security & Compliance
Verify before send/write
How it works
Generate tests or rubrics
Define unit-test stubs, rubric criteria, or assertions for the output.
Run evaluation
Use eval.run_unit_tests or eval.score_qa to execute checks.
Collect signals
Gather pass/fail results, scores, and error messages.
Gate or revise
If checks pass, proceed. If they fail, revise or request approval.
Example workflows
Verify patch builds and tests pass
Quality gate"Add input validation to module Y and write unit tests"
- code.propose_patch (generate diff)
- eval.run_unit_tests (stub tests) — check if patch builds
- If pass: proceed. If fail: revise or request approval
Patch with passing tests—confidence before merge
Score summary against rubric
Quality gate"Draft weekly summary; must have intro, bullets, and action items"
- llm.generate (draft summary)
- eval.score_qa (check structure: intro? bullets? actions?)
- If pass: deliver. If fail: revise
Summary that meets quality standards—fewer revisions
Assert conditions before sending
Quality gate"Draft email; must be polite and under 200 words"
- llm.generate (draft email)
- eval.score_qa (check: polite tone? word count < 200?)
- If pass: send. If fail: revise
Email that meets criteria—confidence before sending
Technical details
Configuration
TIMEOUTS— test execution timeoutsISOLATION_MODE— sandboxed (planned)PASS_THRESHOLD— minimum score to pass
Performance notes
- Unit tests: depends on test complexity
- Rubric scoring: 100-500ms per check
- Assertions: 10-50ms per condition
Observability
- Pass rates and flake metrics
- Score distributions
- Test execution latency
Security posture
Sandboxed execution (planned)
Tests run in isolated containers to prevent side effects.
Timeouts and limits
Test execution is time-limited to prevent runaway processes.
Audit logs
All test runs logged with pass/fail results and timestamps.
Local execution
All tests run locally. No network calls unless explicitly configured.
Roadmap & status
Current features
- Unit-test stubs and contracts
- Rubric scoring
- Assertions and conditions
Coming soon
- Containerized test runners
- Richer signal collection (coverage, performance)
- Test result visualization
Frequently asked questions
Ready to add quality gates?
Get started with Evaluation & QA in minutes. Reduce errors with self-checks.