Skip to content

manifest/bazel: nested-workspace + Bazel-native Maven extraction#1342

Open
Simon (simonhj) wants to merge 27 commits into
v1.xfrom
simon/bazel-subworkspace-discovery-v1x
Open

manifest/bazel: nested-workspace + Bazel-native Maven extraction#1342
Simon (simonhj) wants to merge 27 commits into
v1.xfrom
simon/bazel-subworkspace-discovery-v1x

Conversation

@simonhj
Copy link
Copy Markdown

@simonhj Simon (simonhj) commented May 28, 2026

Summary

Rewrites socket manifest bazel's Maven extraction pipeline so it (a)
discovers every workspace under the scan root, not just cwd, and (b)
relies on Bazel-native commands for repo enumeration instead of static
Starlark regex parsing.

The former version was a bit over engineered, it was doing the following things wrong:

  • Parsing StarLark files to find maven repos, turns out there was a programmatic way of doing it.
  • Repeating server side work by parsing maven_install.json.
  • Not handling nested bazel modules correctluy.

After some more iteration I have landed on a better flow that also more easlily extends to other languages:

1. Walk the scan root for every workspace (MODULE.bazel / WORKSPACE /
   WORKSPACE.bazel). Caller-supplied prune policy.
2. Per workspace, discover Maven hubs:
   - Bzlmod: bazel mod show_extension @rules_jvm_external//:extensions.bzl%maven
   - WORKSPACE: probe conventional names (@maven, @maven_install,
     @maven_dev, @unpinned_maven, @maven_unpinned). Tri-state
     classifier (populated / empty / not-defined).
3. Per populated candidate, run per-repo metadata cquery:
     attr("tags", "\bmaven_coordinates=", @<repo>//...)
     ∪ attr("maven_coordinates", ".+", @<repo>//...)
     ∪ attr("maven_url", ".+", @<repo>//...)
   --output=jsonproto --keep_going
4. Aggregate across all workspaces; dedup by full Maven coordinate.
5. Write synthesized maven_install.json.

The past version was the result my (shrinking) bazel ignorance and maybe trusting the AI a little bit too much :)


Note

Medium Risk
Large change to how SBOMs are produced (many Bazel subprocesses, timeouts, partial results); mitigated by extensive tests and isolated output_user_root, but wrong hub/cquery behavior could miss or skew dependency lists.

Overview
Rewrites socket manifest bazel Maven extraction so it scans every Bazel workspace under the repo (not only cwd) and drives discovery/extraction through Bazel commands instead of parsing Starlark or reusing unsorted_deps.json / cached bazel query probe output.

Discovery: A new workspace walker finds roots with MODULE.bazel / WORKSPACE / WORKSPACE.bazel (caller-supplied ignore rules; CLI wires IGNORED_DIRS plus bazel-*). Per workspace, Maven hubs come from bazel mod show_extension on rules_jvm_external’s maven extension (Bzlmod), with conventional-name cquery probes and a populated / empty / not-defined classifier for legacy WORKSPACE mode and extras from --bazel-maven-repo=.

Extraction: Each hub gets a metadata cquery (jsonproto, union on maven_coordinates tags/attrs/maven_url), parsed in new bazel-cquery.mts. Artifacts aggregate across workspaces and dedupe by full Maven coordinate before writing synthesized maven_install.json. Bazel runs use a per-invocation --output_user_root, with shutdown + tempdir cleanup and fresh roots after per-repo timeouts so hung servers don’t block the scan.

Supporting changes: Maven repo discovery module shrinks to show_extension parsing + probe classification; query runner adds runBazelModShowMavenExtension, shared startup flags, and buildMavenProbeFor; PyPI keeps parseVisibleRepoCandidates locally; ExtractedArtifact.ruleKind is a string for arbitrary rule classes from cquery.

Reviewed by Cursor Bugbot for commit 414a9a6. Configure here.

@simonhj Simon (simonhj) force-pushed the simon/bazel-subworkspace-discovery-v1x branch from 20957bc to 23e2f96 Compare May 28, 2026 19:31
@simonhj
Copy link
Copy Markdown
Author

Claude (@claude) review once

Comment thread src/commands/manifest/bazel/bazel-workspace-walk.mts Outdated
Comment thread src/commands/manifest/bazel/extract_bazel_to_maven.mts Outdated
Comment thread src/commands/manifest/bazel/extract_bazel_to_maven.mts Outdated
Comment thread src/commands/manifest/bazel/extract_bazel_to_maven.mts Outdated
Comment thread src/commands/manifest/bazel/extract_bazel_to_maven.mts
Comment thread src/commands/manifest/bazel/extract_bazel_to_maven.mts
@simonhj Simon (simonhj) force-pushed the simon/bazel-subworkspace-discovery-v1x branch from cf78175 to bec0ccf Compare May 30, 2026 19:53
…ub-workspace discovery

The existing bazel-query discovery path only inspects MODULE.bazel /
WORKSPACE at the invocation cwd. Ruleset repos with per-example
sub-workspaces (rules_kotlin/examples, rules_js/examples, rules_rust,
rules_python) declare additional Maven artifacts in nested MODULE.bazel
projects with their own maven_install.json lockfiles. Those files were
silently dropped, leaving the CLI's SBOM a strict subset of what the
server-side depscan parser already returns from the same tree.

Add a walker that finds every checked-in maven_install.json under cwd
(pruning .git, node_modules, .socket-auto-manifest, and Bazel's
bazel-* convenience symlinks into <output_base>), parses each via the
existing parseUnsortedDepsJson v2-lockfile path, and merges the
artifacts into the SBOM after the bazel-query extraction step. Merge
is keyed by mavenCoordinates so the root workspace's lockfile (which
bazel-query already extracts) does not double-count; conflicting
group:artifact versions across sub-workspaces continue to surface as
the existing loud-failure error in normalizeToMavenInstallJson.

Verified against bazel-bench/oss/rules_kotlin: walker now surfaces all
10 examples/*/maven_install.json files and merges 393 unique artifacts
into the SBOM beyond what the root @kotlin_rules_maven discovery
returns. No regression on tink-java (0 lockfiles) or protobuf (1 root
lockfile, deduped against bazel-query's @maven extraction).
…er walker already covers it

The CLI was walking the tree for **/maven_install.json and **/*_maven_install.json
lockfiles and merging them into its output. The server-side scan walker matches the
same pattern natively via getReportSupportedFiles, so the CLI re-reading these files
duplicated work and produced output that was a strict subset of what the walker
already saw when the scan was uploaded.

Removes:
- bazel-lockfile-discovery.mts (196 lines)
- bazel-lockfile-discovery.test.mts (241 lines)
- extract_bazel_to_maven step 5b (33 lines): the merge-back-into-allArtifacts loop

The .socket-auto-manifest/maven_install.json the CLI emits is still picked up by
the same walker — that composition stays intact. After this change the CLI emits
only what running bazel produces (the complement of the walker's lockfile coverage).
…very

`findWorkspaceRoots` walks the tree from cwd and returns every directory
containing MODULE.bazel / WORKSPACE / WORKSPACE.bazel. Monorepos host
multiple workspace roots (e.g. examples/<name>/MODULE.bazel, mobile/
MODULE.bazel under an otherwise non-Bazel root); the per-workspace
algorithm in the orchestrator runs once per discovered root.

Pruning matches the previous lockfile walker: skip the usual non-workspace
directories (.git, node_modules, .socket-auto-manifest, etc.), Bazel's
`bazel-*` output_base symlinks (so we never recurse into tens of GiB of
generated state), and `dist*` build-output directories. Caps `MAX_WALK_DEPTH`
and `MAX_WORKSPACE_ROOTS` guard against pathological inputs and symlink
loops.

Pure-function module with no Bazel calls; unit tests use a tmpdir
fixture tree and cover the root-only, nested, prune, symlink, and
sort-determinism cases.
…+ probe primitives

Drop all static parsing of MODULE.bazel / WORKSPACE / *.bzl sources.
Bazel itself sees those files via `mod show_extension` and `cquery`; the
CLI no longer needs to interpret Starlark.

`parseShowExtensionOutput` consumes the text-format report from
  bazel mod show_extension @rules_jvm_external//:extensions.bzl%maven
and returns the hub repos (items annotated with `(imported by ...)`).
Generated per-artifact bullets are skipped; `DEBUG:` / `WARNING:` lines
are tolerated; the parser stops at the next `## ` section header so
multi-extension reports don't cross-contaminate.

`classifyProbeResult` turns a raw probe outcome into a tri-state status:
  - populated: code=0 + non-empty stdout
  - empty:     code=1 + "no targets found beneath"
  - not-defined: code=1 + "No repository visible" / "no such package",
                  or code=0 + empty stdout (WORKSPACE-mode silent miss)
The orchestrator treats `empty` and `not-defined` uniformly as skips; the
distinction is preserved for the sidecar status report.

`CONVENTIONAL_MAVEN_REPO_NAMES` exposes the names the legacy WORKSPACE
path probes (`maven`, `maven_install`, `maven_dev`, `unpinned_maven`,
`maven_unpinned`). `--bazel-maven-repo=` extras are appended by the
orchestrator (sibling todo).

Deleted exports: `parseMavenRepoCandidates`, `parseVisibleRepoCandidates`,
`validateMavenRepo`, `discoverMavenRepos`. Their replacements live in the
new primitives above; the orchestrator rewrite that wires them up lands
in a follow-up layer. `extract_bazel_to_maven.mts` does not typecheck
in this intermediate state — fixed in the orchestrator commit.

Tests cover the parser fixture (hub vs generated, separator variants,
multi-section reports), the tri-state classifier (every documented
input), and the verbose-logging contract for `probeCandidate`.
…tate probe

bazel-query-runner now centralises startup-flag construction so every
spawn — query, cquery, mod show_extension, mod dump_repo_mapping —
threads `--bazel-rc`, `--output_user_root`, and `--output_base`
consistently. The new optional `outputUserRoot` field on
`BazelQueryOptions` is the Maven path's hook for per-invocation server
isolation; the orchestrator (next commit) mkdtemp's a fresh path and
will reap the server via `bazel shutdown` + `rm -rf` on success and on
timeout, so timed-out servers no longer leak across CLI invocations.

Add `runBazelModShowMavenExtension`: invokes
  bazel mod show_extension @rules_jvm_external//:extensions.bzl%maven
to enumerate Maven hubs directly from the rules_jvm_external extension
report, replacing the over-enumerating `dump_repo_mapping` surface on
the Maven path. `runBazelModShowVisibleRepos` is kept around for the
legacy PyPI extractor, which has not been rescoped yet.

Replace the Maven-side `buildProbeFor` (which emitted a kind-only
`kind("jvm_import rule|aar_import rule", @repo//:*)` query) with
`buildMavenProbeFor`, a lightweight `cquery '@<name>//... --output=label
--keep_going'` presence check whose result feeds the new tri-state
classifier in bazel-repo-discovery. Kind-only filtering missed
POM-only / native / AAR-without-aar_import artefacts and any future
rules_jvm_external rule shape; the metadata filter is now applied by
the per-repo extraction cquery (next layer), not by the probe.

Update `buildPypiProbeFor`'s return shape to include stderr so it
satisfies the new `RepoProbe` type contract. Move
`parseVisibleRepoCandidates` and the `ValidationResult` type into
bazel-pypi-discovery (their only remaining consumer); the Maven module
no longer carries dump_repo_mapping-shaped code.

Tests cover the new argv shapes for every spawn surface, the
outputUserRoot startup-flag placement (before subcommand), the
Maven probe argv (cquery + @repo//... + --output=label + --keep_going),
and the full result-triple propagation (code/stdout/stderr) that the
tri-state classifier needs.
`runMetadataCqueryForRepo` executes the per-repo extraction cquery and
returns a structured outcome (`ok` / `partial` / `timeout` / `empty` /
`error`) so the orchestrator can populate sidecar status without
custom error plumbing per call site. The cquery target expression is
the union of three predicates — `attr("tags", "\bmaven_coordinates=",
...)`, `attr("maven_coordinates", ".+", ...)`, and `attr("maven_url",
".+", ...)`. That matches rules_jvm_external's `jvm_import` /
`aar_import` shapes, Bazel-native `java_library` with direct
`maven_coordinates`, and POM-only / source-jar shapes that carry only
`maven_url`. Word-boundary `\b` in the tags predicate prevents matches
on values like `pre_maven_coordinates=fake`.

`parseCqueryJsonproto` is defensive about the jsonproto encoding:
dispatches on `attribute[].type`, accepts both camelCase
(`stringValue`, `stringListValue`) and snake_case (`string_value`,
`string_list_value`) payload keys, and tolerates both the Bazel 5+
envelope shape (`{ "results": [{ "target": {...} }] }`) and the older
per-line streamed shape. Coordinate extraction prefers the direct
`maven_coordinates` attribute; falls back to scanning `tags` for
`maven_coordinates=G:A:V`. Provenance lands in `sourceRepo` as
`<workspace-rel-path>:<repoName>` (or just `<repoName>` at the root),
so the orchestrator's dedup can attribute artifacts back to their
discovery site.

Timeout handling: spawn rejections with `timedOut` / `killed` /
`SIGTERM` / `SIGKILL` map to `status: 'timeout'`. The runner does NOT
delete the outputUserRoot — server lifecycle (reap via
`bazel shutdown` + `rm -rf`) is the orchestrator's concern so that a
single tempdir can hold multiple per-repo runs.

Also widen `ExtractedArtifact.ruleKind` from the literal
`'jvm_import' | 'aar_import'` union to `string`. The legacy text-format
parsers only ever set those two values, but the metadata cquery
returns whatever `ruleClass` Bazel reports (`java_library`,
`kt_jvm_import`, any future rules_jvm_external rule). Existing
consumers only read the field diagnostically; nothing else changes.

Tests cover the parser (envelope, per-line stream, snake_case
fallback, direct-vs-tag preference, missing-coordinate skip, empty
input), the argv builder (target expression union, startup-flag
placement, `--bazel-flag` placement, invocationFlags order), and the
runner's status classification including the spawn-timeout branch.
…thm in a tree walk

`extractBazelToMaven` now walks the scan root for every workspace
(MODULE.bazel / WORKSPACE / WORKSPACE.bazel) and runs the per-workspace
extraction algorithm in each one. Monorepos like rules_kotlin
(examples/<name>/MODULE.bazel) and projects with mobile sub-workspaces
(mobile/MODULE.bazel under a non-Bazel root) are no longer
silently dropped to the root-only path.

Per workspace:
  1. Detect Bzlmod vs WORKSPACE mode.
  2. Discover candidate Maven hubs:
       - Bzlmod: bazel mod show_extension @rules_jvm_external//:extensions.bzl%maven,
         parsed via parseShowExtensionOutput.
       - WORKSPACE (or Bzlmod fallback): probe the conventional names
         (maven, maven_install, maven_dev, unpinned_maven, maven_unpinned)
         plus any customer-supplied extras via the tri-state classifier.
  3. Per populated candidate: run the metadata cquery
     (`attr("tags", "\bmaven_coordinates=", @<repo>//...)` ∪ direct
     `maven_coordinates` / `maven_url` attrs) and accept the parsed
     artefacts.
  4. Aggregate, then dedup across workspaces by full Maven coordinate.

Server isolation is now invariant: every Bazel invocation runs under a
per-CLI-call --output_user_root=<tempdir>. On per-repo cquery timeout
the orchestrator reaps the server (`bazel shutdown`) and `rm -rf`'s the
tempdir, then mints a fresh one for subsequent repos — a single bad
hub no longer cascades into the rest of the run. The finally-block
cleanup reaps every tempdir that was minted, including the last one.

Sidecar `manifest-status.json` lands beside the synthesized
`maven_install.json`. Each entry records the repo's classified status
(ok / partial / timeout / empty / error), artifact count, and duration,
so the server-side can surface partial results to the customer. The
top-level `complete: false` flag fires iff any repo timed out.

Deleted: the unsorted_deps.json fast path (`extractFromOneRepo`,
`bazelExternalDir`, `isForceQueryFallbackEnabled` env knob) — the
metadata cquery returns the same GAVs the fast path used to recover,
without depending on bazel-out symlinks or generated artefacts.
Deleted: the lockfile merge (already done in a previous commit on this
branch); deleted: the kind-only probe and dump_repo_mapping enumeration.

The orchestrator's `ExtractBazelOptions` now accepts
`extraMavenRepoNames` (legacy WORKSPACE non-conventional hub names) and
`perRepoTimeoutMs` (per-repo cquery cap). The CLI flag wiring lands in
a sibling commit; existing call sites continue to pass the same fields
they did before.

Existing `extract_bazel_to_maven.test.mts` is pinned to the old
unsorted_deps fast path and is replaced wholesale in the next commit
(test layer).
…e pipeline

The previous tests pinned the legacy unsorted_deps.json fast path,
kind-only probes, and dump_repo_mapping enumeration. The new tests
mock the orchestrator's three external collaborators —
findWorkspaceRoots, runBazelModShowMavenExtension, runMetadataCqueryForRepo —
and assert on the contract that matters: end-to-end Bzlmod and
WORKSPACE-mode flows, the per-repo cquery loop, cross-workspace
coordinate dedup, the timeout → re-mint loop, sidecar
`manifest-status.json` shape, and `extraMavenRepoNames` threading.

Pure-function `normalizeToMavenInstallJson` keeps a focused trio of
unit tests (dedup, version-conflict, sha256-preservation). The
fixture-driven .socket.facts.json non-emission assertion stays so the
Maven-path-vs-facts-path invariant is exercised.

Also patch the PyPI test mock: parseVisibleRepoCandidates moved from
bazel-repo-discovery to bazel-pypi-discovery in a previous commit, so
the test's vi.mock now mirrors the actual export surface. The probe
fixture grows a `stderr` field to match the new RepoProbe contract.
…GNORED_DIRS

`findWorkspaceRoots` no longer hardcodes the directory-prune set —
callers pass `ignoreDirNames: ReadonlySet<string>` and
`ignoreDirPrefixes: readonly string[]` via options. Neither defaults
to anything; absent means no pruning. This keeps the walker decoupled
from any particular ignore policy and avoids duplicating the
codebase-wide `IGNORED_DIRS` list.

`src/utils/glob.mts` exports `IGNORED_DIRS` so the orchestrator can
compose it with Bazel-specific extras. The orchestrator's composed
set: `IGNORED_DIRS` plus `.hg`, `.idea`, `.pnpm-store`,
`.socket-auto-manifest`, `.svn`, `.vscode`; prefixes `bazel-` and
`dist`.

Also tighten `MAX_WALK_DEPTH` from 16 → 8. Deepest workspace marker
observed across the surveyed OSS corpus is 9 (bazel-self test
fixtures); deepest in realistic application code is 7 (checkmk's
thirdparty layout). The cap gives one level of headroom over the
realistic max while still guarding against pathological symlink loops
that slipped past any prefix prune the caller supplied.

Walker test rewritten against the new injected API: covers the
no-prune-by-default case (`node_modules/MODULE.bazel` surfaces unless
the caller ignores `node_modules`), injected name and prefix prunes,
and the bazel-* symlink case under the prefix injection.
No consumer reads it today. The orchestrator still tracks per-repo
timeouts to decide ExtractBazelResult.ok and to reap+remint the
output_user_root, but no longer serialises the per-workspace /
per-repo status report to disk.
Walker:
- Lower MAX_WORKSPACE_ROOTS from 256 to 16; well above realistic
  monorepo counts, tighter guard against pathological inputs.

Orchestrator:
- Inject the workspace-walker prune policy via ExtractBazelOptions
  (`ignoreDirNames`, `ignoreDirPrefixes`) instead of hardcoding it
  inside the orchestrator. The CLI command now owns the policy and
  supplies the codebase-wide IGNORED_DIRS plus the Bazel-specific
  bits (`.socket-auto-manifest`, VCS/IDE dirs, `bazel-*` prefix).
- Drop the `dist*` prefix from the prune policy — it's a JS/web
  convention, not Bazel-specific, and shouldn't be hardcoded.
- Drop the Bzlmod-mode "defensive fallback" probe over the
  conventional Maven hub names. `bazel mod show_extension` is the
  authoritative source under Bzlmod; customer-supplied extras
  (`--bazel-maven-repo=`) still get probed. WORKSPACE mode keeps
  the probe path unchanged. The fallback had no real-world
  justification.
- Hoist the per-workspace `buildQueryOpts` helper to module scope
  (eslint consistent-function-scoping).
- Verbose-mode logging in `reapBazelServer` and `removeTempdir`
  catch blocks so the operator can see why a cleanup step failed
  instead of having it swallowed silently.

Lint:
- bazel-query-runner.mts: add missing `await` on three
  `return runBazelOneShot(...)` sites (@typescript-eslint/return-await).
- bazel-repo-discovery.test.mts: hoist five inline `probe` closures
  to module-scope named consts.
- extract_bazel_to_maven.test.mts: hoist `readManifest` and an
  inline mock factory to module scope; reorder type imports.
- extract_bazel_to_pypi.mts: merge duplicate imports of
  `bazel-pypi-discovery.mts`.
- extract_bazel_to_maven.mts: reorder type imports.
@simonhj Simon (simonhj) force-pushed the simon/bazel-subworkspace-discovery-v1x branch from bec0ccf to 414a9a6 Compare May 31, 2026 11:54
@simonhj Simon (simonhj) marked this pull request as ready for review May 31, 2026 14:23
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 7 potential issues.

Fix All in Cursor

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: WORKSPACE custom hubs not found
    • Added --bazel-maven-repo flag to cmd-manifest-bazel.mts (with socket.json defaults support) and wired it through as extraMavenRepoNames to both the direct CLI and auto-manifest pathways.
  • ✅ Fixed: Empty show_extension skips fallback
    • Changed showExtensionSucceeded to only be set true when parseShowExtensionOutput returns a non-empty hub list, so a zero-exit with no parsed hubs now correctly falls back to conventional name probing.

Create PR

Or push these changes by commenting:

@cursor push 6019185b06
Preview (6019185b06)
diff --git a/src/commands/manifest/bazel/cmd-manifest-bazel.mts b/src/commands/manifest/bazel/cmd-manifest-bazel.mts
--- a/src/commands/manifest/bazel/cmd-manifest-bazel.mts
+++ b/src/commands/manifest/bazel/cmd-manifest-bazel.mts
@@ -51,6 +51,12 @@
       description:
         'Flags forwarded to every bazel invocation (single quoted string)',
     },
+    bazelMavenRepo: {
+      type: 'string',
+      isMultiple: true,
+      description:
+        'Extra Maven hub repo name(s) to probe; repeatable. Use for legacy WORKSPACE projects whose maven_install uses a non-conventional name.',
+    },
     bazelOutputBase: {
       type: 'string',
       description: 'Bazel --output_base for read-only-cache CI environments',
@@ -202,7 +208,7 @@
     sockJson?.defaults?.manifest?.bazel,
   )
 
-  const { ecosystem } = cli.flags
+  const { bazelMavenRepo, ecosystem } = cli.flags
   let { bazel, bazelFlags, bazelOutputBase, bazelRc, out, verbose } = cli.flags
 
   // Set defaults for any flag/arg that is not given. Check socket.json first.
@@ -260,6 +266,12 @@
     }
   }
 
+  // Compose extra Maven repo names from the CLI flag and socket.json defaults.
+  const extraMavenRepoNames: string[] = [
+    ...(Array.isArray(bazelMavenRepo) ? bazelMavenRepo : []),
+    ...(sockJson.defaults?.manifest?.bazel?.bazelMavenRepo ?? []),
+  ].filter(Boolean)
+
   if (verbose) {
     logger.group('- ', parentName, config.commandName, ':')
     logger.group('- flags:', cli.flags)
@@ -318,6 +330,9 @@
         bazelRc: bazelRc as string | undefined,
         bin: bazel as string | undefined,
         cwd,
+        extraMavenRepoNames: extraMavenRepoNames.length
+          ? extraMavenRepoNames
+          : undefined,
         ignoreDirNames: BAZEL_WALKER_IGNORE_DIR_NAMES,
         ignoreDirPrefixes: BAZEL_WALKER_IGNORE_DIR_PREFIXES,
         out: out as string,

diff --git a/src/commands/manifest/bazel/extract_bazel_to_maven.mts b/src/commands/manifest/bazel/extract_bazel_to_maven.mts
--- a/src/commands/manifest/bazel/extract_bazel_to_maven.mts
+++ b/src/commands/manifest/bazel/extract_bazel_to_maven.mts
@@ -260,8 +260,12 @@
   if (mode.bzlmod) {
     const extResult = await runBazelModShowMavenExtension(queryOpts)
     if (extResult.code === 0) {
-      candidates.push(...parseShowExtensionOutput(extResult.stdout))
-      showExtensionSucceeded = true
+      const parsedHubs = parseShowExtensionOutput(extResult.stdout)
+      candidates.push(...parsedHubs)
+      // Only consider successful when hubs were actually parsed; a zero
+      // exit with no recognized hub names (format mismatch, empty report)
+      // should still fall back to conventional probing.
+      showExtensionSucceeded = parsedHubs.length > 0
       if (verbose) {
         logger.log(
           `[VERBOSE] workspace ${workspaceRoot}: show_extension yielded`,

diff --git a/src/commands/manifest/generate_auto_manifest.mts b/src/commands/manifest/generate_auto_manifest.mts
--- a/src/commands/manifest/generate_auto_manifest.mts
+++ b/src/commands/manifest/generate_auto_manifest.mts
@@ -129,6 +129,7 @@
       bazelRc: bazelConfig?.bazelRc,
       bin: bazelConfig?.bazel ?? bazelConfig?.bin,
       cwd,
+      extraMavenRepoNames: bazelConfig?.bazelMavenRepo,
       out: bazelConfig?.out ?? cwd,
       outLayout: 'flat',
       verbose: Boolean(bazelConfig?.verbose) || verbose,

diff --git a/src/utils/socket-json.mts b/src/utils/socket-json.mts
--- a/src/utils/socket-json.mts
+++ b/src/utils/socket-json.mts
@@ -42,6 +42,7 @@
       bazel?: {
         bazel?: string | undefined
         bazelFlags?: string | undefined
+        bazelMavenRepo?: string[] | undefined
         bazelOutputBase?: string | undefined
         bazelRc?: string | undefined
         bin?: string | undefined

You can send follow-ups to the cloud agent here.

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 414a9a6. Configure here.

showExtensionSucceeded
? extras
: [...CONVENTIONAL_MAVEN_REPO_NAMES, ...extras]
).filter(name => !seen.has(name))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WORKSPACE custom hubs not found

High Severity

Legacy WORKSPACE projects whose maven_install uses a non-conventional name are no longer discovered: candidate discovery only probes fixed hub names plus extraMavenRepoNames, and the CLI never passes that option. Repos that the old Starlark scan found (e.g. maven_legacy_app) are skipped, so extraction can report no Maven ecosystem despite a configured hub.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 414a9a6. Configure here.

const ruleKind = rule.ruleClass ?? rule.rule_class ?? 'unknown'
out.push({
deps: [],
mavenCoordinates: extracted.coord,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lockfile dependency graph always empty

Medium Severity

Every artifact from metadata cquery is emitted with an empty deps list, and cquery does not request rule deps attributes. normalizeToMavenInstallJson therefore writes an empty dependencies map even when Bazel rules declare transitive Maven edges, unlike the prior probe/unsorted_deps.json path.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 414a9a6. Configure here.

buildMetadataCqueryExpr(repoName),
'--output=jsonproto',
'--proto:output_rule_attrs=tags,maven_coordinates,maven_url',
'--keep_going',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jar shasums no longer extracted

Medium Severity

Metadata cquery never surfaces maven_sha256 (typically in tags), so synthesized artifacts lack checksums and normalizeToMavenInstallJson writes empty shasums.jar entries. The old build-output parser populated jar digests from rule attributes and tags.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 414a9a6. Configure here.

showExtensionSucceeded
? extras
: [...CONVENTIONAL_MAVEN_REPO_NAMES, ...extras]
).filter(name => !seen.has(name))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty show_extension skips fallback

High Severity

On Bzlmod workspaces, when bazel mod show_extension exits 0 but parseShowExtensionOutput returns no hub names, discovery still treats the command as successful and only probes extraMavenRepoNames. The conventional @maven hub probe list is skipped, so Maven can be missed despite a healthy Bazel exit.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 414a9a6. Configure here.

artifactCount: deduped.length,
manifestPath,
ok: true,
ok: !anyTimeout,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Success logged despite incomplete run

Medium Severity

After a per-repo cquery timeout, the orchestrator returns ok: false but still calls logger.success whenever any artifacts were written. Callers see a success message while the result indicates an incomplete extraction.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 414a9a6. Configure here.

`attr("tags", "\\bmaven_coordinates=", ${r})`,
`attr("maven_coordinates", ".+", ${r})`,
`attr("maven_url", ".+", ${r})`,
].join(' union ')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maven_url-only rules dropped

Medium Severity

The metadata cquery union includes rules that only set maven_url, but extractMavenCoordinate returns nothing without a coordinate. Those targets are analyzed and then discarded, so POM-only or coordinate-less Maven shapes never reach maven_install.json.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 414a9a6. Configure here.

`[VERBOSE] workspace walker: hit MAX_WORKSPACE_ROOTS cap (${MAX_WORKSPACE_ROOTS}); truncating walk`,
)
}
break
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workspace cap truncates silently

Medium Severity

findWorkspaceRoots stops after 16 workspace markers with only a verbose log. Nested Bazel modules beyond that cap are never scanned, so Maven hubs in omitted trees are invisible with no failure outcome.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 414a9a6. Configure here.

…in the cquery

The per-repo metadata cquery only requested coordinate attrs, so every
artifact shipped with an empty deps list and the synthesized
maven_install.json carried no dependency edges. Request deps/exports/
runtime_deps and resolve those label edges into versionless Maven
coordinates at parse time, while the repo name is still in scope.

- Read LABEL_LIST attributes (Bazel serializes label lists into the
  string-list payload); a STRING_LIST-only reader returned nothing and
  left the graph empty.
- Resolve each edge against this repo's own targets by full label, with a
  unique-suffix fallback. Edges to hub-prefixed targets that resolve to
  nothing (missing or ambiguous) are reported as unresolvedLabels and flip
  the repo status to partial — a scan must not silently drop edges it knows
  it lost. Resolution happens with the repo name in scope so no malformed
  provenance label is ever built.
- Normalize coordinates by stripping only the trailing version, preserving
  packaging/classifier (g:a, g:a:aar, g:a:jar:classifier) to match the
  server parser's coordinate keying.
- Drop the maven_url-only selector: those rules carry no coordinate, so
  selecting them only to discard them downstream was wasted analysis.
- normalizeToMavenInstallJson now consumes resolved coordinate deps directly
  instead of re-resolving labels.
Two correctness follow-ups from review of the graph-extraction change:

- dedupArtifactsByCoord kept only the first artifact per full coordinate,
  which silently dropped the resolved deps of later occurrences. Now that
  edges are populated, the same coordinate emitted by a hub in two
  workspaces can carry different resolved edge sets — union them instead of
  discarding all but the first.
- The already-a-coordinate dep fallback treated any ':'-bearing non-@ label
  as a Maven coordinate; exclude '//'-prefixed package-relative Bazel labels
  so a //pkg:thing edge isn't mis-recorded as a dependency.
…dead parser

normalizeToMavenInstallJson now builds the manifest in two phases so the
emitted graph survives the server parser, which rejects malformed
coordinates and edges referencing unlisted artifacts (and can abort after
enough errors):
- Phase 1 validates each versionless key (2-4 non-empty segments) and a
  non-empty version before accepting an artifact; a coordinate like 'g:a:'
  strips to a valid-shaped key but an empty version, so require both.
- Phase 2 emits only edges whose source and target are both emitted keys.
- Dropped artifacts and pruned edges are returned (and warned) rather than
  silently lost post-upload.

Also drop the dead shasums machinery (the server hydrates checksums from
precrawl and never reads them) and delete the now-unreachable
bazel-build-parser module: the legacy bazel-query/unsorted_deps text
parsers were superseded by the metadata cquery. The ExtractedArtifact type
moves into bazel-cquery.mts (its mavenSha256/mavenUrl fields go with the
dead code).
…arser

The query-output text/JSON fixtures were consumed only by the now-deleted
bazel-build-parser test; nothing references them anymore.
…status

Replace the single aggregated maven_install.json with one manifest per
(workspace, hub), written best-effort so a single wedged hub never discards
the manifests every other hub produced. Paths mirror the workspace tree:
<manifestDir>/<relPath>/<name>.json, where <name> is maven_install.json for
a hub named 'maven' else <hub>_maven_install.json (the server walker's glob).
Per-hub output also dissolves the cross-workspace version-conflict hard-fail
structurally — each hub is coursier-resolved to one version per g:a, so the
same g:a at different versions in two workspaces lands in separate files.

Replace the overloaded ok boolean with a discriminated run status —
complete / partial / noEcosystem / hardFailure — because best-effort-per-hub
produces outcomes a single bool conflates. A hub that times out, errors,
drops unresolved edges, or has malformed coordinates marks the run partial
(manifests still written and worth uploading); zero manifests with an
ecosystem present is a hardFailure; absent ecosystem is noEcosystem.
logger.success only fires on complete; partial warns honestly.

Thread the status through both callers: generate_auto_manifest pushes
manifests on complete and partial (warning on partial) and only throws on
hardFailure; cmd-manifest-bazel's evaluateEcosystemOutcomes gates on
manifestPaths length, counts partial as produced output, preserves the
per-mode noEcosystem branching, and normalizes the single-manifestPath PyPI
result into the shared status vocabulary.
Two follow-ups from review of the per-hub change:

- Wrap the per-hub manifest write in try/catch. A filesystem error
  (EACCES/ENOSPC, a path collision) thrown from writeHubManifest escaped
  the loop to the outer catch and returned hardFailure with zero
  manifestPaths, discarding the manifests other hubs had already written —
  the exact best-effort invariant the per-hub design promises. Now a write
  failure warns, counts the hub failed, and continues.
- Treat a cquery 'partial' status with no unresolved labels (a non-zero
  --keep_going exit that still yielded a usable subset) as a partial hub.
  Previously only unresolved labels flipped the hub partial, so this case
  could report the whole run complete on a known-incomplete graph.
The maven module extension generates a hub for every module that uses it,
so show_extension returned the root's own maven.install hub(s) plus the
rulesets' internal hubs (rules_jvm_external_deps, stardoc_maven, …) — which
got folded into the user's SBOM and, with per-hub output, each became its
own spurious manifest.

parseShowExtensionOutput now returns structured { name, importers } entries
(merging importers when a repo appears on multiple lines) instead of bare
names. The orchestrator keeps only hubs imported by <root> and drops the
rest, naming the dropped hub and its importing module under --verbose.

Gate the conventional-name probe fallback on the KEPT hub count, not the raw
parse count: a report listing only transitive ruleset hubs (all filtered
out) now correctly falls through to probing so a root @maven isn't missed.
Also pass --extension_usages=<root> to the maven show_extension call as a
belt-and-suspenders output reducer (the importers-filter is what guarantees
correctness).
…le comments

The extraMavenRepoNames option documented a --bazel-maven-repo flag that was
never built; the field was dead plumbing. Remove it from ExtractBazelOptions,
drop the extras read and its pass-through into discoverCandidatesForWorkspace
(collapsing the probe ternary to conventional-names-or-nothing), and delete
the comments claiming a flag wires it in. Custom-bazel-flag passthrough is a
separate feature that will re-introduce the param when it actually wires a
flag.

Also fix stale walker comments that claimed a 'dist' prefix is pruned — the
real Bazel prune prefix list is just ['bazel-'].
…covery header

Follow-up: the discovery module docstring still described the never-built
--bazel-maven-repo flag feeding the WORKSPACE probe list. No such pass-through
exists; the probe list is the fixed conventional names.
…orchestrator

The --auto-manifest caller never passed ignoreDirNames/ignoreDirPrefixes, so
its walk defaulted to empty prune sets and descended node_modules/.git and
vendored trees — a stray WORKSPACE fixture under node_modules would be picked
up as a workspace (finding A). Move the default Bazel prune policy into a
shared constant the orchestrator applies unconditionally; callers now EXTEND
it via ignoreDirNames/ignoreDirPrefixes rather than being the sole source of
it. Both entry points (explicit command and auto-manifest) now prune
identically, and the policy can't be forgotten again. Drop the now-redundant
per-command constants from cmd-manifest-bazel.
The walk had two silent-drop bounds: a 16-root early-stop and a
MAX_WALK_DEPTH=8 cap that continued past anything deeper with no log. The
depth cap was the more dangerous — its own comment admitted the deepest
corpus marker is 9, i.e. the cap sat below an observed depth and silently
dropped a first-party module that exists today, taking its Maven hub with
it.

Replace both with a single visited-directory budget (MAX_WALK_DIRS) as the
only pathological-input / symlink-loop guard; remove the depth cap entirely.
Walk children in deterministic sorted order, collect all roots found within
the budget, sort, then cap to MAX_WORKSPACE_ROOTS. Both the cap and a budget
exhaustion now logger.warn UNCONDITIONALLY (exact dropped count when the full
tree was walked, '≥' when the budget cut it short), and verbose lists the
dropped paths — truncation is never silent.
The fixture comment claimed Bazel callers pass a 'dist' prune prefix; the
real production default is just ['bazel-']. Reword to make clear 'dist' is an
arbitrary extra prefix used only to exercise the walker's generic multi-prefix
pruning, not a production policy.
Round out the verbose diagnostics so every per-hub step is observable: log
the start of each metadata cquery (repo + timeout) before it runs, complementing
the existing kept/dropped-hub, per-hub status/path, unresolved-edge,
best-effort-failure, truncation, and final-summary traces. A user diagnosing a
thin or partial SBOM can now follow the full discovery-to-write sequence.
…est-free

pypiOutcome carried the detected-but-empty PyPI stub path into manifestPaths
on a hardFailure outcome, violating the invariant that a non-success outcome
produced no usable output. Only a complete PyPI run now carries its manifest
path; hardFailure and noEcosystem carry none.
…; document hub limitation

Cross-agent review round 1:
- The registry spawn rejects on a non-zero exit, so a real --keep_going
  cquery that exits non-zero while still emitting a usable subset landed in
  the catch block, which hardcoded status 'error' and discarded the subset
  (the hub was skipped). Classify the catch outcome via classifyCqueryOutcome
  so a parsed subset becomes 'partial' and is written best-effort; only a
  genuinely empty failure stays 'error'. Timeout remains distinct.
- Document the accepted custom-Maven-hub limitation in the command help: a
  non-conventional hub name show_extension doesn't enumerate isn't discovered
  yet (the naming flag is a planned follow-up) — a known, not silent, gap.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants