Agentic Eval Skill for Extensibility and Maintainability by dennishuo · Pull Request #4519 · apache/polaris

dennishuo · 2026-05-21T09:12:37Z

More detailed proposal doc here: https://docs.google.com/document/d/1RE5mGcrMLbmi8sglkHuJKxORVNiuiZ69da1weqwpGjE/edit?tab=t.0

Mailing list discussion here: https://lists.apache.org/thread/518o8q58jnyd70gcok6j5mw9t4nco687

Adds .agents/skills/polaris-extensibility-eval/ — an A/B harness for measuring agentic-development impact of repo changes (AGENTS.md edits, extension-surface refactors, etc.).

The skill spawns fresh, context-free coding-agent subprocesses (claude-cli / codex / cursor) in scrubbed-env worktrees pinned to BEFORE and AFTER refs, runs concrete tasks, captures verifier verdicts and per-cell cost/wall/token usage, and reports A/B deltas across (task × arm × model × cli) cells.

Ran a minimal meta-eval with Claude Opus 4.7 and it got the following results as a proof-of-concept just exercising the "add privilege" task with a "sample change" to AGENTS.md (main...dennishuo:polaris:dhuo-polaris-eval-test-A2):

## Task & fixture

- **Task**: `tasks/seed/T-priv-add.yaml` — add the enum constant
  `LIST_NAMESPACE_TABLES_RECURSIVE` to `PolarisAuthorizableOperation`,
  ensure compile + `*PolarisAuthorizer*` tests pass without modifying
  any test file. The task is a *probe* of the authorizer SPI: a naive
  one-file edit (enum only) trips the static initializer in
  `RbacOperationSemantics.java` and breaks 4 tests; the correct two-file
  change (enum + register call) passes.
- **BEFORE ref**: `568a8883` (Polaris main HEAD on 2026-05-16).
- **AFTER ref**: `c9b37227` (TEMP local fixture: AGENTS.md +100 lines —
  "Recipes for Common Extension Tasks" section that explicitly tells
  agents to also edit `RbacOperationSemantics.register(...)`). The
  fixture only changes `AGENTS.md`; no source code differs between BASE
  and AFTER.

The task's deterministic verifier runs out-of-band from the worker
agent (separate `bash` subprocess after the worker's transcript is
captured) so worker self-reports cannot fake a PASS.

## Headline results

| Cell | Verdict | Wall (s) | Cost (USD) | Tokens out | Turns | Files in diff |
|------|---------|---------:|-----------:|-----------:|------:|---------------|
| haiku-base   | PASS  | 270 | $0.362 |  9374 | 59 | 2 (enum + Rbac) |
| haiku-after  | PASS  | 157 | $0.226 |  5657 | 36 | 2 (enum + Rbac) |
| opus-base    | PASS  | 204 | $1.481 | 10112 | 24 | 2 (enum + Rbac) |
| opus-after   | PASS  | 124 | $0.854 |  5150 | 15 | 2 (enum + Rbac) |
| codex-base   | **FAIL** | 37 | n/a | n/a | n/a | **1 (enum only)** |
| codex-after  | PASS  |  39 | n/a | n/a | n/a | 2 (enum + Rbac) |

Per-arm deltas (BEFORE → AFTER, AFTER doc helps):

| Model  | Wall Δ | Cost Δ  | Turns Δ | Verdict Δ |
|--------|-------:|--------:|--------:|-----------|
| haiku  | -42%   | -38%    | -39%    | PASS → PASS (soft-improvement) |
| opus   | -39%   | -42%    | -38%    | PASS → PASS (soft-improvement) |
| codex  |  +5%   | n/a     | n/a     | **FAIL → PASS** (hard improvement) |

Total: 6 cells, 13m 49s wall, $2.92 spend. One discriminating
verdict-flip + two consistent ~40% cost reductions on the same
task — clear, replicable signal that the AGENTS.md recipe addition is
agent-load-bearing.

## What's interesting

### 1. Discriminating verdict-flip on codex-base

Codex's sandbox is stricter than claude's about workspace HOME — it
blocked the gradle wrapper from downloading its distribution
(`~/.gradle` was the throwaway `mktemp` HOME, no cache, no network).
**Codex never ran a successful verification step itself.** With no
local feedback signal, codex-base shipped a one-file diff (enum-only)
and reported "DONE", confident it was complete.

Codex-after, given the explicit AGENTS.md recipe, edited both files
without ever running gradle. The harness's out-of-band verifier ran
both diffs against a real Java 21 + populated gradle cache, and:

- codex-base verifier: `BUILD FAILED` — 4 tests failed because
  `RbacOperationSemantics`'s static initializer aborted at class load
  with `Missing RBAC semantics for operations: [LIST_NAMESPACE_TABLES_RECURSIVE]`
- codex-after verifier: `BUILD SUCCESSFUL` — all
  `*PolarisAuthorizer*` tests pass.

This is exactly the verdict-discriminating result the harness is
designed to surface: an AGENTS.md change with no functional code
delta turned a deterministic FAIL into a deterministic PASS for one
of the three agents tested. Reviewable, reproducible, attributable.

### 2. Soft-improvement on claude arms (no leak)

Both claude haiku-base and opus-base PASSed without seeing the
AGENTS.md recipe. They derived the RbacOperationSemantics gotcha
from code:

- Haiku-base ran `./gradlew :polaris-core:test "*PolarisAuthorizer*"`,
  saw the static initializer's `Missing RBAC semantics for operations`
  exception, then edited the second file. 59 turns of search /
  test / iterate.
- Opus-base inspected `RbacOperationSemanticsTest` directly, noticed
  the `allOperationsHaveExplicitRbacSemantics` assertion, and edited
  both files in 24 turns.

Both **without any leak from the AFTER recipe**: I grepped each
arm's full transcript for the recipe's distinctive phrases
("recipe", "extension recipes", "view-snapshot", "attribute key")
— zero hits in BEFORE arms. Hygiene mitigation (a) (fresh-process
CLI dispatch with scrubbed HOME) held.

The cost reduction on the AFTER arm is the recipe shortening their
investigation: they go straight to the right answer without a
discovery loop. ~40% cost / time / turns reduction across both
claude models is the value of the documentation change. Reviewable
*because* of the cost delta, not in spite of it.

Checklist

🛡️ Don't disclose security issues! (contact security@apache.org)
🔗 Clearly explained why the changes are needed, or linked related issues: Fixes #
🧪 Added/updated tests with good coverage, or manually tested (and explained how)
💡 Added comments for complex logic
🧾 Updated CHANGELOG.md (if needed)
📚 Updated documentation in site/content/in-dev/unreleased (if needed)

Adds .agents/skills/polaris-extensibility-eval/ — an A/B harness for measuring agentic-development impact of repo changes (AGENTS.md edits, extension-surface refactors, etc.). The skill spawns fresh, context-free coding-agent subprocesses (claude-cli / codex / cursor) in scrubbed-env worktrees pinned to BEFORE and AFTER refs, runs concrete tasks, captures verifier verdicts and per-cell cost/wall/token usage, and reports A/B deltas across (task × arm × model × cli) cells. Initial seed task bank targets the highest-friction extension surfaces identified from PR / mailing-list research: authorizer SPI / RBAC, events listener architecture, federation factory, predefined policy registry. Includes a discriminating typed-attribute task that exercises Java type-erasure / TypeToken handling. This commit only adds the skill itself. .agents/ remains gitignored for other tools; only this skill subdir is unignored. .meta-eval/ run artifacts stay gitignored.

github-project-automation Bot added this to Basic Kanban Board May 21, 2026

github-project-automation Bot moved this to PRs In Progress in Basic Kanban Board May 21, 2026

jbonofre self-requested a review May 21, 2026 09:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agentic Eval Skill for Extensibility and Maintainability#4519

Agentic Eval Skill for Extensibility and Maintainability#4519
dennishuo wants to merge 1 commit into
apache:mainfrom
dennishuo:dhuo-polaris-extensibility-eval-skill

dennishuo commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dennishuo commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dennishuo commented May 21, 2026 •

edited

Loading