Skip to main content
Available now

Self-checks that keep the assistant honest

Run unit-test stubs, rubric scoring, and assertions. Gate outputs on passing checks.

What Evaluation & QA does

Evaluation & QA provides self-checking capabilities that verify outputs before they're delivered. Run unit-test stubs, score against rubrics, and assert conditions. Gate high-risk outputs on passing checks to reduce errors and improve reliability.

Core capabilities

  • Run unit-test stubs
  • Rubric scoring for quality
  • Assertions and conditions
  • Gate outputs on pass/fail

Use cases

  • Verify code patches build and pass tests
  • Score summaries against structure rubrics
  • Assert conditions before sending outputs
Sandboxed execution (planned): Tests run in isolated containers to prevent side effects.

Who benefits from Evaluation & QA

Individuals

Fewer errors in outputs

Example: "Draft summary passes structure rubric; else revise" — quality gates before delivery.

Teams & Managers

Quality gates for repeatable runs

Example: Ensure all team outputs meet quality standards before delivery.

Developers & IT

Hook tests into plans

Example: "Patch builds and tests pass; else request approval or fix" — CI-like workflows.

Security & Compliance

Verify before send/write

Control: Gate high-risk outputs on passing checks. Reduce errors and compliance risks.

How it works

1

Generate tests or rubrics

Define unit-test stubs, rubric criteria, or assertions for the output.

2

Run evaluation

Use eval.run_unit_tests or eval.score_qa to execute checks.

3

Collect signals

Gather pass/fail results, scores, and error messages.

4

Gate or revise

If checks pass, proceed. If they fail, revise or request approval.

Isolation: Sandboxed execution (planned) prevents side effects from test runs.

Example workflows

Verify patch builds and tests pass

Quality gate
Input:

"Add input validation to module Y and write unit tests"

Steps:
  1. code.propose_patch (generate diff)
  2. eval.run_unit_tests (stub tests) — check if patch builds
  3. If pass: proceed. If fail: revise or request approval
Output:

Patch with passing tests—confidence before merge

Score summary against rubric

Quality gate
Input:

"Draft weekly summary; must have intro, bullets, and action items"

Steps:
  1. llm.generate (draft summary)
  2. eval.score_qa (check structure: intro? bullets? actions?)
  3. If pass: deliver. If fail: revise
Output:

Summary that meets quality standards—fewer revisions

Assert conditions before sending

Quality gate
Input:

"Draft email; must be polite and under 200 words"

Steps:
  1. llm.generate (draft email)
  2. eval.score_qa (check: polite tone? word count < 200?)
  3. If pass: send. If fail: revise
Output:

Email that meets criteria—confidence before sending

Technical details

Key tools

  • eval.run_unit_tests
  • eval.score_qa
  • eval.assert
  • eval.collect_signals
View tool schemas

Configuration

  • TIMEOUTS — test execution timeouts
  • ISOLATION_MODE — sandboxed (planned)
  • PASS_THRESHOLD — minimum score to pass

Performance notes

  • Unit tests: depends on test complexity
  • Rubric scoring: 100-500ms per check
  • Assertions: 10-50ms per condition

Observability

  • Pass rates and flake metrics
  • Score distributions
  • Test execution latency

Security posture

Sandboxed execution (planned)

Tests run in isolated containers to prevent side effects.

Timeouts and limits

Test execution is time-limited to prevent runaway processes.

Audit logs

All test runs logged with pass/fail results and timestamps.

Local execution

All tests run locally. No network calls unless explicitly configured.

Roadmap & status

Available

Current features

  • Unit-test stubs and contracts
  • Rubric scoring
  • Assertions and conditions
Planned

Coming soon

  • Containerized test runners
  • Richer signal collection (coverage, performance)
  • Test result visualization
View full roadmap

Frequently asked questions

Ready to add quality gates?

Get started with Evaluation & QA in minutes. Reduce errors with self-checks.