Available now

Self-checks that keep the assistant honest

Run unit-test stubs, rubric scoring, and assertions. Gate outputs on passing checks.

What Evaluation & QA does

Evaluation & QA provides self-checking capabilities that verify outputs before they're delivered. Run unit-test stubs, score against rubrics, and assert conditions. Gate high-risk outputs on passing checks to reduce errors and improve reliability.

Core capabilities

Run unit-test stubs
Rubric scoring for quality
Assertions and conditions
Gate outputs on pass/fail

Use cases

Verify code patches build and pass tests
Score summaries against structure rubrics
Assert conditions before sending outputs

Sandboxed execution (planned): Tests run in isolated containers to prevent side effects.

Who benefits from Evaluation & QA

Individuals

Fewer errors in outputs

Example: "Draft summary passes structure rubric; else revise" — quality gates before delivery.

Teams & Managers

Quality gates for repeatable runs

Example: Ensure all team outputs meet quality standards before delivery.

Developers & IT

Hook tests into plans

Example: "Patch builds and tests pass; else request approval or fix" — CI-like workflows.

Security & Compliance

Verify before send/write

Control: Gate high-risk outputs on passing checks. Reduce errors and compliance risks.

How it works

Generate tests or rubrics

Define unit-test stubs, rubric criteria, or assertions for the output.

Run evaluation

Use eval.run_unit_tests or eval.score_qa to execute checks.

Collect signals

Gather pass/fail results, scores, and error messages.

Gate or revise

If checks pass, proceed. If they fail, revise or request approval.

Isolation: Sandboxed execution (planned) prevents side effects from test runs.

Example workflows

Verify patch builds and tests pass

Quality gate

Input:

"Add input validation to module Y and write unit tests"

Steps:

code.propose_patch (generate diff)
eval.run_unit_tests (stub tests) — check if patch builds
If pass: proceed. If fail: revise or request approval

Output:

Patch with passing tests—confidence before merge

Score summary against rubric

Quality gate

Input:

"Draft weekly summary; must have intro, bullets, and action items"

Steps:

llm.generate (draft summary)
eval.score_qa (check structure: intro? bullets? actions?)
If pass: deliver. If fail: revise

Output:

Summary that meets quality standards—fewer revisions

Assert conditions before sending

Quality gate

Input:

"Draft email; must be polite and under 200 words"

Steps:

llm.generate (draft email)
eval.score_qa (check: polite tone? word count < 200?)
If pass: send. If fail: revise

Output:

Email that meets criteria—confidence before sending

Technical details

Key tools

eval.run_unit_tests
eval.score_qa
eval.assert
eval.collect_signals

View tool schemas

Configuration

TIMEOUTS — test execution timeouts
ISOLATION_MODE — sandboxed (planned)
PASS_THRESHOLD — minimum score to pass

Performance notes

Unit tests: depends on test complexity
Rubric scoring: 100-500ms per check
Assertions: 10-50ms per condition

Observability

Pass rates and flake metrics
Score distributions
Test execution latency

Security posture

Sandboxed execution (planned)

Tests run in isolated containers to prevent side effects.

Timeouts and limits

Test execution is time-limited to prevent runaway processes.

Audit logs

All test runs logged with pass/fail results and timestamps.

Local execution

All tests run locally. No network calls unless explicitly configured.

Roadmap & status

Available

Current features

Unit-test stubs and contracts
Rubric scoring
Assertions and conditions

Planned

Coming soon

Containerized test runners
Richer signal collection (coverage, performance)
Test result visualization

View full roadmap

Frequently asked questions

Ready to add quality gates?

Get started with Evaluation & QA in minutes. Reduce errors with self-checks.

See a self-check View examples Talk to us

Self-checks that keep the assistant honest

What Evaluation & QA does

Core capabilities

Use cases

Who benefits from Evaluation & QA

Individuals

Teams & Managers

Developers & IT

Security & Compliance

How it works

Generate tests or rubrics

Run evaluation

Collect signals

Gate or revise

Example workflows

Verify patch builds and tests pass

Score summary against rubric

Assert conditions before sending

Technical details

Key tools

Configuration

Performance notes

Observability

Security posture

Sandboxed execution (planned)

Timeouts and limits

Audit logs

Local execution

Roadmap & status

Current features

Coming soon

Frequently asked questions

How do I define a rubric?

Can I run real unit tests?

What happens if a test fails?

Can I gate outputs on passing checks?

Are test results logged?

Can I use custom test frameworks?

How do I prevent flaky tests?

Ready to add quality gates?