Skip to content

Agentic Eval Skill for Extensibility and Maintainability#4519

Open
dennishuo wants to merge 1 commit into
apache:mainfrom
dennishuo:dhuo-polaris-extensibility-eval-skill
Open

Agentic Eval Skill for Extensibility and Maintainability#4519
dennishuo wants to merge 1 commit into
apache:mainfrom
dennishuo:dhuo-polaris-extensibility-eval-skill

Conversation

@dennishuo
Copy link
Copy Markdown
Contributor

@dennishuo dennishuo commented May 21, 2026

More detailed proposal doc here: https://docs.google.com/document/d/1RE5mGcrMLbmi8sglkHuJKxORVNiuiZ69da1weqwpGjE/edit?tab=t.0

Mailing list discussion here: https://lists.apache.org/thread/518o8q58jnyd70gcok6j5mw9t4nco687

Adds .agents/skills/polaris-extensibility-eval/ — an A/B harness for measuring agentic-development impact of repo changes (AGENTS.md edits, extension-surface refactors, etc.).

The skill spawns fresh, context-free coding-agent subprocesses (claude-cli / codex / cursor) in scrubbed-env worktrees pinned to BEFORE and AFTER refs, runs concrete tasks, captures verifier verdicts and per-cell cost/wall/token usage, and reports A/B deltas across (task × arm × model × cli) cells.

Ran a minimal meta-eval with Claude Opus 4.7 and it got the following results as a proof-of-concept just exercising the "add privilege" task with a "sample change" to AGENTS.md (main...dennishuo:polaris:dhuo-polaris-eval-test-A2):

## Task & fixture

- **Task**: `tasks/seed/T-priv-add.yaml` — add the enum constant
  `LIST_NAMESPACE_TABLES_RECURSIVE` to `PolarisAuthorizableOperation`,
  ensure compile + `*PolarisAuthorizer*` tests pass without modifying
  any test file. The task is a *probe* of the authorizer SPI: a naive
  one-file edit (enum only) trips the static initializer in
  `RbacOperationSemantics.java` and breaks 4 tests; the correct two-file
  change (enum + register call) passes.
- **BEFORE ref**: `568a8883` (Polaris main HEAD on 2026-05-16).
- **AFTER ref**: `c9b37227` (TEMP local fixture: AGENTS.md +100 lines —
  "Recipes for Common Extension Tasks" section that explicitly tells
  agents to also edit `RbacOperationSemantics.register(...)`). The
  fixture only changes `AGENTS.md`; no source code differs between BASE
  and AFTER.

The task's deterministic verifier runs out-of-band from the worker
agent (separate `bash` subprocess after the worker's transcript is
captured) so worker self-reports cannot fake a PASS.

## Headline results

| Cell | Verdict | Wall (s) | Cost (USD) | Tokens out | Turns | Files in diff |
|------|---------|---------:|-----------:|-----------:|------:|---------------|
| haiku-base   | PASS  | 270 | $0.362 |  9374 | 59 | 2 (enum + Rbac) |
| haiku-after  | PASS  | 157 | $0.226 |  5657 | 36 | 2 (enum + Rbac) |
| opus-base    | PASS  | 204 | $1.481 | 10112 | 24 | 2 (enum + Rbac) |
| opus-after   | PASS  | 124 | $0.854 |  5150 | 15 | 2 (enum + Rbac) |
| codex-base   | **FAIL** | 37 | n/a | n/a | n/a | **1 (enum only)** |
| codex-after  | PASS  |  39 | n/a | n/a | n/a | 2 (enum + Rbac) |

Per-arm deltas (BEFORE → AFTER, AFTER doc helps):

| Model  | Wall Δ | Cost Δ  | Turns Δ | Verdict Δ |
|--------|-------:|--------:|--------:|-----------|
| haiku  | -42%   | -38%    | -39%    | PASS → PASS (soft-improvement) |
| opus   | -39%   | -42%    | -38%    | PASS → PASS (soft-improvement) |
| codex  |  +5%   | n/a     | n/a     | **FAIL → PASS** (hard improvement) |

Total: 6 cells, 13m 49s wall, $2.92 spend. One discriminating
verdict-flip + two consistent ~40% cost reductions on the same
task — clear, replicable signal that the AGENTS.md recipe addition is
agent-load-bearing.

## What's interesting

### 1. Discriminating verdict-flip on codex-base

Codex's sandbox is stricter than claude's about workspace HOME — it
blocked the gradle wrapper from downloading its distribution
(`~/.gradle` was the throwaway `mktemp` HOME, no cache, no network).
**Codex never ran a successful verification step itself.** With no
local feedback signal, codex-base shipped a one-file diff (enum-only)
and reported "DONE", confident it was complete.

Codex-after, given the explicit AGENTS.md recipe, edited both files
without ever running gradle. The harness's out-of-band verifier ran
both diffs against a real Java 21 + populated gradle cache, and:

- codex-base verifier: `BUILD FAILED` — 4 tests failed because
  `RbacOperationSemantics`'s static initializer aborted at class load
  with `Missing RBAC semantics for operations: [LIST_NAMESPACE_TABLES_RECURSIVE]`
- codex-after verifier: `BUILD SUCCESSFUL` — all
  `*PolarisAuthorizer*` tests pass.

This is exactly the verdict-discriminating result the harness is
designed to surface: an AGENTS.md change with no functional code
delta turned a deterministic FAIL into a deterministic PASS for one
of the three agents tested. Reviewable, reproducible, attributable.

### 2. Soft-improvement on claude arms (no leak)

Both claude haiku-base and opus-base PASSed without seeing the
AGENTS.md recipe. They derived the RbacOperationSemantics gotcha
from code:

- Haiku-base ran `./gradlew :polaris-core:test "*PolarisAuthorizer*"`,
  saw the static initializer's `Missing RBAC semantics for operations`
  exception, then edited the second file. 59 turns of search /
  test / iterate.
- Opus-base inspected `RbacOperationSemanticsTest` directly, noticed
  the `allOperationsHaveExplicitRbacSemantics` assertion, and edited
  both files in 24 turns.

Both **without any leak from the AFTER recipe**: I grepped each
arm's full transcript for the recipe's distinctive phrases
("recipe", "extension recipes", "view-snapshot", "attribute key")
— zero hits in BEFORE arms. Hygiene mitigation (a) (fresh-process
CLI dispatch with scrubbed HOME) held.

The cost reduction on the AFTER arm is the recipe shortening their
investigation: they go straight to the right answer without a
discovery loop. ~40% cost / time / turns reduction across both
claude models is the value of the documentation change. Reviewable
*because* of the cost delta, not in spite of it.

Checklist

  • 🛡️ Don't disclose security issues! (contact security@apache.org)
  • 🔗 Clearly explained why the changes are needed, or linked related issues: Fixes #
  • 🧪 Added/updated tests with good coverage, or manually tested (and explained how)
  • 💡 Added comments for complex logic
  • 🧾 Updated CHANGELOG.md (if needed)
  • 📚 Updated documentation in site/content/in-dev/unreleased (if needed)

Adds .agents/skills/polaris-extensibility-eval/ — an A/B harness for
measuring agentic-development impact of repo changes (AGENTS.md edits,
extension-surface refactors, etc.).

The skill spawns fresh, context-free coding-agent subprocesses
(claude-cli / codex / cursor) in scrubbed-env worktrees pinned to
BEFORE and AFTER refs, runs concrete tasks, captures verifier
verdicts and per-cell cost/wall/token usage, and reports A/B deltas
across (task × arm × model × cli) cells.

Initial seed task bank targets the highest-friction extension surfaces
identified from PR / mailing-list research: authorizer SPI / RBAC,
events listener architecture, federation factory, predefined policy
registry. Includes a discriminating typed-attribute task that exercises
Java type-erasure / TypeToken handling.

This commit only adds the skill itself. .agents/ remains gitignored
for other tools; only this skill subdir is unignored. .meta-eval/
run artifacts stay gitignored.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant