manifest/bazel: nested-workspace + Bazel-native Maven extraction#1342
manifest/bazel: nested-workspace + Bazel-native Maven extraction#1342Simon (simonhj) wants to merge 27 commits into
Conversation
20957bc to
23e2f96
Compare
|
Claude (@claude) review once |
cf78175 to
bec0ccf
Compare
…ub-workspace discovery The existing bazel-query discovery path only inspects MODULE.bazel / WORKSPACE at the invocation cwd. Ruleset repos with per-example sub-workspaces (rules_kotlin/examples, rules_js/examples, rules_rust, rules_python) declare additional Maven artifacts in nested MODULE.bazel projects with their own maven_install.json lockfiles. Those files were silently dropped, leaving the CLI's SBOM a strict subset of what the server-side depscan parser already returns from the same tree. Add a walker that finds every checked-in maven_install.json under cwd (pruning .git, node_modules, .socket-auto-manifest, and Bazel's bazel-* convenience symlinks into <output_base>), parses each via the existing parseUnsortedDepsJson v2-lockfile path, and merges the artifacts into the SBOM after the bazel-query extraction step. Merge is keyed by mavenCoordinates so the root workspace's lockfile (which bazel-query already extracts) does not double-count; conflicting group:artifact versions across sub-workspaces continue to surface as the existing loud-failure error in normalizeToMavenInstallJson. Verified against bazel-bench/oss/rules_kotlin: walker now surfaces all 10 examples/*/maven_install.json files and merges 393 unique artifacts into the SBOM beyond what the root @kotlin_rules_maven discovery returns. No regression on tink-java (0 lockfiles) or protobuf (1 root lockfile, deduped against bazel-query's @maven extraction).
…er walker already covers it The CLI was walking the tree for **/maven_install.json and **/*_maven_install.json lockfiles and merging them into its output. The server-side scan walker matches the same pattern natively via getReportSupportedFiles, so the CLI re-reading these files duplicated work and produced output that was a strict subset of what the walker already saw when the scan was uploaded. Removes: - bazel-lockfile-discovery.mts (196 lines) - bazel-lockfile-discovery.test.mts (241 lines) - extract_bazel_to_maven step 5b (33 lines): the merge-back-into-allArtifacts loop The .socket-auto-manifest/maven_install.json the CLI emits is still picked up by the same walker — that composition stays intact. After this change the CLI emits only what running bazel produces (the complement of the walker's lockfile coverage).
…very `findWorkspaceRoots` walks the tree from cwd and returns every directory containing MODULE.bazel / WORKSPACE / WORKSPACE.bazel. Monorepos host multiple workspace roots (e.g. examples/<name>/MODULE.bazel, mobile/ MODULE.bazel under an otherwise non-Bazel root); the per-workspace algorithm in the orchestrator runs once per discovered root. Pruning matches the previous lockfile walker: skip the usual non-workspace directories (.git, node_modules, .socket-auto-manifest, etc.), Bazel's `bazel-*` output_base symlinks (so we never recurse into tens of GiB of generated state), and `dist*` build-output directories. Caps `MAX_WALK_DEPTH` and `MAX_WORKSPACE_ROOTS` guard against pathological inputs and symlink loops. Pure-function module with no Bazel calls; unit tests use a tmpdir fixture tree and cover the root-only, nested, prune, symlink, and sort-determinism cases.
…+ probe primitives
Drop all static parsing of MODULE.bazel / WORKSPACE / *.bzl sources.
Bazel itself sees those files via `mod show_extension` and `cquery`; the
CLI no longer needs to interpret Starlark.
`parseShowExtensionOutput` consumes the text-format report from
bazel mod show_extension @rules_jvm_external//:extensions.bzl%maven
and returns the hub repos (items annotated with `(imported by ...)`).
Generated per-artifact bullets are skipped; `DEBUG:` / `WARNING:` lines
are tolerated; the parser stops at the next `## ` section header so
multi-extension reports don't cross-contaminate.
`classifyProbeResult` turns a raw probe outcome into a tri-state status:
- populated: code=0 + non-empty stdout
- empty: code=1 + "no targets found beneath"
- not-defined: code=1 + "No repository visible" / "no such package",
or code=0 + empty stdout (WORKSPACE-mode silent miss)
The orchestrator treats `empty` and `not-defined` uniformly as skips; the
distinction is preserved for the sidecar status report.
`CONVENTIONAL_MAVEN_REPO_NAMES` exposes the names the legacy WORKSPACE
path probes (`maven`, `maven_install`, `maven_dev`, `unpinned_maven`,
`maven_unpinned`). `--bazel-maven-repo=` extras are appended by the
orchestrator (sibling todo).
Deleted exports: `parseMavenRepoCandidates`, `parseVisibleRepoCandidates`,
`validateMavenRepo`, `discoverMavenRepos`. Their replacements live in the
new primitives above; the orchestrator rewrite that wires them up lands
in a follow-up layer. `extract_bazel_to_maven.mts` does not typecheck
in this intermediate state — fixed in the orchestrator commit.
Tests cover the parser fixture (hub vs generated, separator variants,
multi-section reports), the tri-state classifier (every documented
input), and the verbose-logging contract for `probeCandidate`.
…tate probe
bazel-query-runner now centralises startup-flag construction so every
spawn — query, cquery, mod show_extension, mod dump_repo_mapping —
threads `--bazel-rc`, `--output_user_root`, and `--output_base`
consistently. The new optional `outputUserRoot` field on
`BazelQueryOptions` is the Maven path's hook for per-invocation server
isolation; the orchestrator (next commit) mkdtemp's a fresh path and
will reap the server via `bazel shutdown` + `rm -rf` on success and on
timeout, so timed-out servers no longer leak across CLI invocations.
Add `runBazelModShowMavenExtension`: invokes
bazel mod show_extension @rules_jvm_external//:extensions.bzl%maven
to enumerate Maven hubs directly from the rules_jvm_external extension
report, replacing the over-enumerating `dump_repo_mapping` surface on
the Maven path. `runBazelModShowVisibleRepos` is kept around for the
legacy PyPI extractor, which has not been rescoped yet.
Replace the Maven-side `buildProbeFor` (which emitted a kind-only
`kind("jvm_import rule|aar_import rule", @repo//:*)` query) with
`buildMavenProbeFor`, a lightweight `cquery '@<name>//... --output=label
--keep_going'` presence check whose result feeds the new tri-state
classifier in bazel-repo-discovery. Kind-only filtering missed
POM-only / native / AAR-without-aar_import artefacts and any future
rules_jvm_external rule shape; the metadata filter is now applied by
the per-repo extraction cquery (next layer), not by the probe.
Update `buildPypiProbeFor`'s return shape to include stderr so it
satisfies the new `RepoProbe` type contract. Move
`parseVisibleRepoCandidates` and the `ValidationResult` type into
bazel-pypi-discovery (their only remaining consumer); the Maven module
no longer carries dump_repo_mapping-shaped code.
Tests cover the new argv shapes for every spawn surface, the
outputUserRoot startup-flag placement (before subcommand), the
Maven probe argv (cquery + @repo//... + --output=label + --keep_going),
and the full result-triple propagation (code/stdout/stderr) that the
tri-state classifier needs.
`runMetadataCqueryForRepo` executes the per-repo extraction cquery and
returns a structured outcome (`ok` / `partial` / `timeout` / `empty` /
`error`) so the orchestrator can populate sidecar status without
custom error plumbing per call site. The cquery target expression is
the union of three predicates — `attr("tags", "\bmaven_coordinates=",
...)`, `attr("maven_coordinates", ".+", ...)`, and `attr("maven_url",
".+", ...)`. That matches rules_jvm_external's `jvm_import` /
`aar_import` shapes, Bazel-native `java_library` with direct
`maven_coordinates`, and POM-only / source-jar shapes that carry only
`maven_url`. Word-boundary `\b` in the tags predicate prevents matches
on values like `pre_maven_coordinates=fake`.
`parseCqueryJsonproto` is defensive about the jsonproto encoding:
dispatches on `attribute[].type`, accepts both camelCase
(`stringValue`, `stringListValue`) and snake_case (`string_value`,
`string_list_value`) payload keys, and tolerates both the Bazel 5+
envelope shape (`{ "results": [{ "target": {...} }] }`) and the older
per-line streamed shape. Coordinate extraction prefers the direct
`maven_coordinates` attribute; falls back to scanning `tags` for
`maven_coordinates=G:A:V`. Provenance lands in `sourceRepo` as
`<workspace-rel-path>:<repoName>` (or just `<repoName>` at the root),
so the orchestrator's dedup can attribute artifacts back to their
discovery site.
Timeout handling: spawn rejections with `timedOut` / `killed` /
`SIGTERM` / `SIGKILL` map to `status: 'timeout'`. The runner does NOT
delete the outputUserRoot — server lifecycle (reap via
`bazel shutdown` + `rm -rf`) is the orchestrator's concern so that a
single tempdir can hold multiple per-repo runs.
Also widen `ExtractedArtifact.ruleKind` from the literal
`'jvm_import' | 'aar_import'` union to `string`. The legacy text-format
parsers only ever set those two values, but the metadata cquery
returns whatever `ruleClass` Bazel reports (`java_library`,
`kt_jvm_import`, any future rules_jvm_external rule). Existing
consumers only read the field diagnostically; nothing else changes.
Tests cover the parser (envelope, per-line stream, snake_case
fallback, direct-vs-tag preference, missing-coordinate skip, empty
input), the argv builder (target expression union, startup-flag
placement, `--bazel-flag` placement, invocationFlags order), and the
runner's status classification including the spawn-timeout branch.
…thm in a tree walk
`extractBazelToMaven` now walks the scan root for every workspace
(MODULE.bazel / WORKSPACE / WORKSPACE.bazel) and runs the per-workspace
extraction algorithm in each one. Monorepos like rules_kotlin
(examples/<name>/MODULE.bazel) and projects with mobile sub-workspaces
(mobile/MODULE.bazel under a non-Bazel root) are no longer
silently dropped to the root-only path.
Per workspace:
1. Detect Bzlmod vs WORKSPACE mode.
2. Discover candidate Maven hubs:
- Bzlmod: bazel mod show_extension @rules_jvm_external//:extensions.bzl%maven,
parsed via parseShowExtensionOutput.
- WORKSPACE (or Bzlmod fallback): probe the conventional names
(maven, maven_install, maven_dev, unpinned_maven, maven_unpinned)
plus any customer-supplied extras via the tri-state classifier.
3. Per populated candidate: run the metadata cquery
(`attr("tags", "\bmaven_coordinates=", @<repo>//...)` ∪ direct
`maven_coordinates` / `maven_url` attrs) and accept the parsed
artefacts.
4. Aggregate, then dedup across workspaces by full Maven coordinate.
Server isolation is now invariant: every Bazel invocation runs under a
per-CLI-call --output_user_root=<tempdir>. On per-repo cquery timeout
the orchestrator reaps the server (`bazel shutdown`) and `rm -rf`'s the
tempdir, then mints a fresh one for subsequent repos — a single bad
hub no longer cascades into the rest of the run. The finally-block
cleanup reaps every tempdir that was minted, including the last one.
Sidecar `manifest-status.json` lands beside the synthesized
`maven_install.json`. Each entry records the repo's classified status
(ok / partial / timeout / empty / error), artifact count, and duration,
so the server-side can surface partial results to the customer. The
top-level `complete: false` flag fires iff any repo timed out.
Deleted: the unsorted_deps.json fast path (`extractFromOneRepo`,
`bazelExternalDir`, `isForceQueryFallbackEnabled` env knob) — the
metadata cquery returns the same GAVs the fast path used to recover,
without depending on bazel-out symlinks or generated artefacts.
Deleted: the lockfile merge (already done in a previous commit on this
branch); deleted: the kind-only probe and dump_repo_mapping enumeration.
The orchestrator's `ExtractBazelOptions` now accepts
`extraMavenRepoNames` (legacy WORKSPACE non-conventional hub names) and
`perRepoTimeoutMs` (per-repo cquery cap). The CLI flag wiring lands in
a sibling commit; existing call sites continue to pass the same fields
they did before.
Existing `extract_bazel_to_maven.test.mts` is pinned to the old
unsorted_deps fast path and is replaced wholesale in the next commit
(test layer).
…e pipeline The previous tests pinned the legacy unsorted_deps.json fast path, kind-only probes, and dump_repo_mapping enumeration. The new tests mock the orchestrator's three external collaborators — findWorkspaceRoots, runBazelModShowMavenExtension, runMetadataCqueryForRepo — and assert on the contract that matters: end-to-end Bzlmod and WORKSPACE-mode flows, the per-repo cquery loop, cross-workspace coordinate dedup, the timeout → re-mint loop, sidecar `manifest-status.json` shape, and `extraMavenRepoNames` threading. Pure-function `normalizeToMavenInstallJson` keeps a focused trio of unit tests (dedup, version-conflict, sha256-preservation). The fixture-driven .socket.facts.json non-emission assertion stays so the Maven-path-vs-facts-path invariant is exercised. Also patch the PyPI test mock: parseVisibleRepoCandidates moved from bazel-repo-discovery to bazel-pypi-discovery in a previous commit, so the test's vi.mock now mirrors the actual export surface. The probe fixture grows a `stderr` field to match the new RepoProbe contract.
…GNORED_DIRS `findWorkspaceRoots` no longer hardcodes the directory-prune set — callers pass `ignoreDirNames: ReadonlySet<string>` and `ignoreDirPrefixes: readonly string[]` via options. Neither defaults to anything; absent means no pruning. This keeps the walker decoupled from any particular ignore policy and avoids duplicating the codebase-wide `IGNORED_DIRS` list. `src/utils/glob.mts` exports `IGNORED_DIRS` so the orchestrator can compose it with Bazel-specific extras. The orchestrator's composed set: `IGNORED_DIRS` plus `.hg`, `.idea`, `.pnpm-store`, `.socket-auto-manifest`, `.svn`, `.vscode`; prefixes `bazel-` and `dist`. Also tighten `MAX_WALK_DEPTH` from 16 → 8. Deepest workspace marker observed across the surveyed OSS corpus is 9 (bazel-self test fixtures); deepest in realistic application code is 7 (checkmk's thirdparty layout). The cap gives one level of headroom over the realistic max while still guarding against pathological symlink loops that slipped past any prefix prune the caller supplied. Walker test rewritten against the new injected API: covers the no-prune-by-default case (`node_modules/MODULE.bazel` surfaces unless the caller ignores `node_modules`), injected name and prefix prunes, and the bazel-* symlink case under the prefix injection.
No consumer reads it today. The orchestrator still tracks per-repo timeouts to decide ExtractBazelResult.ok and to reap+remint the output_user_root, but no longer serialises the per-workspace / per-repo status report to disk.
Walker: - Lower MAX_WORKSPACE_ROOTS from 256 to 16; well above realistic monorepo counts, tighter guard against pathological inputs. Orchestrator: - Inject the workspace-walker prune policy via ExtractBazelOptions (`ignoreDirNames`, `ignoreDirPrefixes`) instead of hardcoding it inside the orchestrator. The CLI command now owns the policy and supplies the codebase-wide IGNORED_DIRS plus the Bazel-specific bits (`.socket-auto-manifest`, VCS/IDE dirs, `bazel-*` prefix). - Drop the `dist*` prefix from the prune policy — it's a JS/web convention, not Bazel-specific, and shouldn't be hardcoded. - Drop the Bzlmod-mode "defensive fallback" probe over the conventional Maven hub names. `bazel mod show_extension` is the authoritative source under Bzlmod; customer-supplied extras (`--bazel-maven-repo=`) still get probed. WORKSPACE mode keeps the probe path unchanged. The fallback had no real-world justification. - Hoist the per-workspace `buildQueryOpts` helper to module scope (eslint consistent-function-scoping). - Verbose-mode logging in `reapBazelServer` and `removeTempdir` catch blocks so the operator can see why a cleanup step failed instead of having it swallowed silently. Lint: - bazel-query-runner.mts: add missing `await` on three `return runBazelOneShot(...)` sites (@typescript-eslint/return-await). - bazel-repo-discovery.test.mts: hoist five inline `probe` closures to module-scope named consts. - extract_bazel_to_maven.test.mts: hoist `readManifest` and an inline mock factory to module scope; reorder type imports. - extract_bazel_to_pypi.mts: merge duplicate imports of `bazel-pypi-discovery.mts`. - extract_bazel_to_maven.mts: reorder type imports.
bec0ccf to
414a9a6
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 7 potential issues.
Bugbot Autofix prepared fixes for both issues found in the latest run.
- ✅ Fixed: WORKSPACE custom hubs not found
- Added --bazel-maven-repo flag to cmd-manifest-bazel.mts (with socket.json defaults support) and wired it through as extraMavenRepoNames to both the direct CLI and auto-manifest pathways.
- ✅ Fixed: Empty show_extension skips fallback
- Changed showExtensionSucceeded to only be set true when parseShowExtensionOutput returns a non-empty hub list, so a zero-exit with no parsed hubs now correctly falls back to conventional name probing.
Or push these changes by commenting:
@cursor push 6019185b06
Preview (6019185b06)
diff --git a/src/commands/manifest/bazel/cmd-manifest-bazel.mts b/src/commands/manifest/bazel/cmd-manifest-bazel.mts
--- a/src/commands/manifest/bazel/cmd-manifest-bazel.mts
+++ b/src/commands/manifest/bazel/cmd-manifest-bazel.mts
@@ -51,6 +51,12 @@
description:
'Flags forwarded to every bazel invocation (single quoted string)',
},
+ bazelMavenRepo: {
+ type: 'string',
+ isMultiple: true,
+ description:
+ 'Extra Maven hub repo name(s) to probe; repeatable. Use for legacy WORKSPACE projects whose maven_install uses a non-conventional name.',
+ },
bazelOutputBase: {
type: 'string',
description: 'Bazel --output_base for read-only-cache CI environments',
@@ -202,7 +208,7 @@
sockJson?.defaults?.manifest?.bazel,
)
- const { ecosystem } = cli.flags
+ const { bazelMavenRepo, ecosystem } = cli.flags
let { bazel, bazelFlags, bazelOutputBase, bazelRc, out, verbose } = cli.flags
// Set defaults for any flag/arg that is not given. Check socket.json first.
@@ -260,6 +266,12 @@
}
}
+ // Compose extra Maven repo names from the CLI flag and socket.json defaults.
+ const extraMavenRepoNames: string[] = [
+ ...(Array.isArray(bazelMavenRepo) ? bazelMavenRepo : []),
+ ...(sockJson.defaults?.manifest?.bazel?.bazelMavenRepo ?? []),
+ ].filter(Boolean)
+
if (verbose) {
logger.group('- ', parentName, config.commandName, ':')
logger.group('- flags:', cli.flags)
@@ -318,6 +330,9 @@
bazelRc: bazelRc as string | undefined,
bin: bazel as string | undefined,
cwd,
+ extraMavenRepoNames: extraMavenRepoNames.length
+ ? extraMavenRepoNames
+ : undefined,
ignoreDirNames: BAZEL_WALKER_IGNORE_DIR_NAMES,
ignoreDirPrefixes: BAZEL_WALKER_IGNORE_DIR_PREFIXES,
out: out as string,
diff --git a/src/commands/manifest/bazel/extract_bazel_to_maven.mts b/src/commands/manifest/bazel/extract_bazel_to_maven.mts
--- a/src/commands/manifest/bazel/extract_bazel_to_maven.mts
+++ b/src/commands/manifest/bazel/extract_bazel_to_maven.mts
@@ -260,8 +260,12 @@
if (mode.bzlmod) {
const extResult = await runBazelModShowMavenExtension(queryOpts)
if (extResult.code === 0) {
- candidates.push(...parseShowExtensionOutput(extResult.stdout))
- showExtensionSucceeded = true
+ const parsedHubs = parseShowExtensionOutput(extResult.stdout)
+ candidates.push(...parsedHubs)
+ // Only consider successful when hubs were actually parsed; a zero
+ // exit with no recognized hub names (format mismatch, empty report)
+ // should still fall back to conventional probing.
+ showExtensionSucceeded = parsedHubs.length > 0
if (verbose) {
logger.log(
`[VERBOSE] workspace ${workspaceRoot}: show_extension yielded`,
diff --git a/src/commands/manifest/generate_auto_manifest.mts b/src/commands/manifest/generate_auto_manifest.mts
--- a/src/commands/manifest/generate_auto_manifest.mts
+++ b/src/commands/manifest/generate_auto_manifest.mts
@@ -129,6 +129,7 @@
bazelRc: bazelConfig?.bazelRc,
bin: bazelConfig?.bazel ?? bazelConfig?.bin,
cwd,
+ extraMavenRepoNames: bazelConfig?.bazelMavenRepo,
out: bazelConfig?.out ?? cwd,
outLayout: 'flat',
verbose: Boolean(bazelConfig?.verbose) || verbose,
diff --git a/src/utils/socket-json.mts b/src/utils/socket-json.mts
--- a/src/utils/socket-json.mts
+++ b/src/utils/socket-json.mts
@@ -42,6 +42,7 @@
bazel?: {
bazel?: string | undefined
bazelFlags?: string | undefined
+ bazelMavenRepo?: string[] | undefined
bazelOutputBase?: string | undefined
bazelRc?: string | undefined
bin?: string | undefinedYou can send follow-ups to the cloud agent here.
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 414a9a6. Configure here.
| showExtensionSucceeded | ||
| ? extras | ||
| : [...CONVENTIONAL_MAVEN_REPO_NAMES, ...extras] | ||
| ).filter(name => !seen.has(name)) |
There was a problem hiding this comment.
WORKSPACE custom hubs not found
High Severity
Legacy WORKSPACE projects whose maven_install uses a non-conventional name are no longer discovered: candidate discovery only probes fixed hub names plus extraMavenRepoNames, and the CLI never passes that option. Repos that the old Starlark scan found (e.g. maven_legacy_app) are skipped, so extraction can report no Maven ecosystem despite a configured hub.
Reviewed by Cursor Bugbot for commit 414a9a6. Configure here.
| const ruleKind = rule.ruleClass ?? rule.rule_class ?? 'unknown' | ||
| out.push({ | ||
| deps: [], | ||
| mavenCoordinates: extracted.coord, |
There was a problem hiding this comment.
Lockfile dependency graph always empty
Medium Severity
Every artifact from metadata cquery is emitted with an empty deps list, and cquery does not request rule deps attributes. normalizeToMavenInstallJson therefore writes an empty dependencies map even when Bazel rules declare transitive Maven edges, unlike the prior probe/unsorted_deps.json path.
Reviewed by Cursor Bugbot for commit 414a9a6. Configure here.
| buildMetadataCqueryExpr(repoName), | ||
| '--output=jsonproto', | ||
| '--proto:output_rule_attrs=tags,maven_coordinates,maven_url', | ||
| '--keep_going', |
There was a problem hiding this comment.
Jar shasums no longer extracted
Medium Severity
Metadata cquery never surfaces maven_sha256 (typically in tags), so synthesized artifacts lack checksums and normalizeToMavenInstallJson writes empty shasums.jar entries. The old build-output parser populated jar digests from rule attributes and tags.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 414a9a6. Configure here.
| showExtensionSucceeded | ||
| ? extras | ||
| : [...CONVENTIONAL_MAVEN_REPO_NAMES, ...extras] | ||
| ).filter(name => !seen.has(name)) |
There was a problem hiding this comment.
Empty show_extension skips fallback
High Severity
On Bzlmod workspaces, when bazel mod show_extension exits 0 but parseShowExtensionOutput returns no hub names, discovery still treats the command as successful and only probes extraMavenRepoNames. The conventional @maven hub probe list is skipped, so Maven can be missed despite a healthy Bazel exit.
Reviewed by Cursor Bugbot for commit 414a9a6. Configure here.
| artifactCount: deduped.length, | ||
| manifestPath, | ||
| ok: true, | ||
| ok: !anyTimeout, |
There was a problem hiding this comment.
Success logged despite incomplete run
Medium Severity
After a per-repo cquery timeout, the orchestrator returns ok: false but still calls logger.success whenever any artifacts were written. Callers see a success message while the result indicates an incomplete extraction.
Reviewed by Cursor Bugbot for commit 414a9a6. Configure here.
| `attr("tags", "\\bmaven_coordinates=", ${r})`, | ||
| `attr("maven_coordinates", ".+", ${r})`, | ||
| `attr("maven_url", ".+", ${r})`, | ||
| ].join(' union ') |
There was a problem hiding this comment.
maven_url-only rules dropped
Medium Severity
The metadata cquery union includes rules that only set maven_url, but extractMavenCoordinate returns nothing without a coordinate. Those targets are analyzed and then discarded, so POM-only or coordinate-less Maven shapes never reach maven_install.json.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 414a9a6. Configure here.
| `[VERBOSE] workspace walker: hit MAX_WORKSPACE_ROOTS cap (${MAX_WORKSPACE_ROOTS}); truncating walk`, | ||
| ) | ||
| } | ||
| break |
There was a problem hiding this comment.
Workspace cap truncates silently
Medium Severity
findWorkspaceRoots stops after 16 workspace markers with only a verbose log. Nested Bazel modules beyond that cap are never scanned, so Maven hubs in omitted trees are invisible with no failure outcome.
Reviewed by Cursor Bugbot for commit 414a9a6. Configure here.
…in the cquery The per-repo metadata cquery only requested coordinate attrs, so every artifact shipped with an empty deps list and the synthesized maven_install.json carried no dependency edges. Request deps/exports/ runtime_deps and resolve those label edges into versionless Maven coordinates at parse time, while the repo name is still in scope. - Read LABEL_LIST attributes (Bazel serializes label lists into the string-list payload); a STRING_LIST-only reader returned nothing and left the graph empty. - Resolve each edge against this repo's own targets by full label, with a unique-suffix fallback. Edges to hub-prefixed targets that resolve to nothing (missing or ambiguous) are reported as unresolvedLabels and flip the repo status to partial — a scan must not silently drop edges it knows it lost. Resolution happens with the repo name in scope so no malformed provenance label is ever built. - Normalize coordinates by stripping only the trailing version, preserving packaging/classifier (g:a, g:a:aar, g:a:jar:classifier) to match the server parser's coordinate keying. - Drop the maven_url-only selector: those rules carry no coordinate, so selecting them only to discard them downstream was wasted analysis. - normalizeToMavenInstallJson now consumes resolved coordinate deps directly instead of re-resolving labels.
Two correctness follow-ups from review of the graph-extraction change: - dedupArtifactsByCoord kept only the first artifact per full coordinate, which silently dropped the resolved deps of later occurrences. Now that edges are populated, the same coordinate emitted by a hub in two workspaces can carry different resolved edge sets — union them instead of discarding all but the first. - The already-a-coordinate dep fallback treated any ':'-bearing non-@ label as a Maven coordinate; exclude '//'-prefixed package-relative Bazel labels so a //pkg:thing edge isn't mis-recorded as a dependency.
…dead parser normalizeToMavenInstallJson now builds the manifest in two phases so the emitted graph survives the server parser, which rejects malformed coordinates and edges referencing unlisted artifacts (and can abort after enough errors): - Phase 1 validates each versionless key (2-4 non-empty segments) and a non-empty version before accepting an artifact; a coordinate like 'g:a:' strips to a valid-shaped key but an empty version, so require both. - Phase 2 emits only edges whose source and target are both emitted keys. - Dropped artifacts and pruned edges are returned (and warned) rather than silently lost post-upload. Also drop the dead shasums machinery (the server hydrates checksums from precrawl and never reads them) and delete the now-unreachable bazel-build-parser module: the legacy bazel-query/unsorted_deps text parsers were superseded by the metadata cquery. The ExtractedArtifact type moves into bazel-cquery.mts (its mavenSha256/mavenUrl fields go with the dead code).
…arser The query-output text/JSON fixtures were consumed only by the now-deleted bazel-build-parser test; nothing references them anymore.
…status Replace the single aggregated maven_install.json with one manifest per (workspace, hub), written best-effort so a single wedged hub never discards the manifests every other hub produced. Paths mirror the workspace tree: <manifestDir>/<relPath>/<name>.json, where <name> is maven_install.json for a hub named 'maven' else <hub>_maven_install.json (the server walker's glob). Per-hub output also dissolves the cross-workspace version-conflict hard-fail structurally — each hub is coursier-resolved to one version per g:a, so the same g:a at different versions in two workspaces lands in separate files. Replace the overloaded ok boolean with a discriminated run status — complete / partial / noEcosystem / hardFailure — because best-effort-per-hub produces outcomes a single bool conflates. A hub that times out, errors, drops unresolved edges, or has malformed coordinates marks the run partial (manifests still written and worth uploading); zero manifests with an ecosystem present is a hardFailure; absent ecosystem is noEcosystem. logger.success only fires on complete; partial warns honestly. Thread the status through both callers: generate_auto_manifest pushes manifests on complete and partial (warning on partial) and only throws on hardFailure; cmd-manifest-bazel's evaluateEcosystemOutcomes gates on manifestPaths length, counts partial as produced output, preserves the per-mode noEcosystem branching, and normalizes the single-manifestPath PyPI result into the shared status vocabulary.
Two follow-ups from review of the per-hub change: - Wrap the per-hub manifest write in try/catch. A filesystem error (EACCES/ENOSPC, a path collision) thrown from writeHubManifest escaped the loop to the outer catch and returned hardFailure with zero manifestPaths, discarding the manifests other hubs had already written — the exact best-effort invariant the per-hub design promises. Now a write failure warns, counts the hub failed, and continues. - Treat a cquery 'partial' status with no unresolved labels (a non-zero --keep_going exit that still yielded a usable subset) as a partial hub. Previously only unresolved labels flipped the hub partial, so this case could report the whole run complete on a known-incomplete graph.
The maven module extension generates a hub for every module that uses it,
so show_extension returned the root's own maven.install hub(s) plus the
rulesets' internal hubs (rules_jvm_external_deps, stardoc_maven, …) — which
got folded into the user's SBOM and, with per-hub output, each became its
own spurious manifest.
parseShowExtensionOutput now returns structured { name, importers } entries
(merging importers when a repo appears on multiple lines) instead of bare
names. The orchestrator keeps only hubs imported by <root> and drops the
rest, naming the dropped hub and its importing module under --verbose.
Gate the conventional-name probe fallback on the KEPT hub count, not the raw
parse count: a report listing only transitive ruleset hubs (all filtered
out) now correctly falls through to probing so a root @maven isn't missed.
Also pass --extension_usages=<root> to the maven show_extension call as a
belt-and-suspenders output reducer (the importers-filter is what guarantees
correctness).
…le comments The extraMavenRepoNames option documented a --bazel-maven-repo flag that was never built; the field was dead plumbing. Remove it from ExtractBazelOptions, drop the extras read and its pass-through into discoverCandidatesForWorkspace (collapsing the probe ternary to conventional-names-or-nothing), and delete the comments claiming a flag wires it in. Custom-bazel-flag passthrough is a separate feature that will re-introduce the param when it actually wires a flag. Also fix stale walker comments that claimed a 'dist' prefix is pruned — the real Bazel prune prefix list is just ['bazel-'].
…covery header Follow-up: the discovery module docstring still described the never-built --bazel-maven-repo flag feeding the WORKSPACE probe list. No such pass-through exists; the probe list is the fixed conventional names.
…orchestrator The --auto-manifest caller never passed ignoreDirNames/ignoreDirPrefixes, so its walk defaulted to empty prune sets and descended node_modules/.git and vendored trees — a stray WORKSPACE fixture under node_modules would be picked up as a workspace (finding A). Move the default Bazel prune policy into a shared constant the orchestrator applies unconditionally; callers now EXTEND it via ignoreDirNames/ignoreDirPrefixes rather than being the sole source of it. Both entry points (explicit command and auto-manifest) now prune identically, and the policy can't be forgotten again. Drop the now-redundant per-command constants from cmd-manifest-bazel.
The walk had two silent-drop bounds: a 16-root early-stop and a MAX_WALK_DEPTH=8 cap that continued past anything deeper with no log. The depth cap was the more dangerous — its own comment admitted the deepest corpus marker is 9, i.e. the cap sat below an observed depth and silently dropped a first-party module that exists today, taking its Maven hub with it. Replace both with a single visited-directory budget (MAX_WALK_DIRS) as the only pathological-input / symlink-loop guard; remove the depth cap entirely. Walk children in deterministic sorted order, collect all roots found within the budget, sort, then cap to MAX_WORKSPACE_ROOTS. Both the cap and a budget exhaustion now logger.warn UNCONDITIONALLY (exact dropped count when the full tree was walked, '≥' when the budget cut it short), and verbose lists the dropped paths — truncation is never silent.
The fixture comment claimed Bazel callers pass a 'dist' prune prefix; the real production default is just ['bazel-']. Reword to make clear 'dist' is an arbitrary extra prefix used only to exercise the walker's generic multi-prefix pruning, not a production policy.
Round out the verbose diagnostics so every per-hub step is observable: log the start of each metadata cquery (repo + timeout) before it runs, complementing the existing kept/dropped-hub, per-hub status/path, unresolved-edge, best-effort-failure, truncation, and final-summary traces. A user diagnosing a thin or partial SBOM can now follow the full discovery-to-write sequence.
…est-free pypiOutcome carried the detected-but-empty PyPI stub path into manifestPaths on a hardFailure outcome, violating the invariant that a non-success outcome produced no usable output. Only a complete PyPI run now carries its manifest path; hardFailure and noEcosystem carry none.
…; document hub limitation Cross-agent review round 1: - The registry spawn rejects on a non-zero exit, so a real --keep_going cquery that exits non-zero while still emitting a usable subset landed in the catch block, which hardcoded status 'error' and discarded the subset (the hub was skipped). Classify the catch outcome via classifyCqueryOutcome so a parsed subset becomes 'partial' and is written best-effort; only a genuinely empty failure stays 'error'. Timeout remains distinct. - Document the accepted custom-Maven-hub limitation in the command help: a non-conventional hub name show_extension doesn't enumerate isn't discovered yet (the naming flag is a planned follow-up) — a known, not silent, gap.



Summary
Rewrites
socket manifest bazel's Maven extraction pipeline so it (a)discovers every workspace under the scan root, not just
cwd, and (b)relies on Bazel-native commands for repo enumeration instead of static
Starlark regex parsing.
The former version was a bit over engineered, it was doing the following things wrong:
maven_install.json.After some more iteration I have landed on a better flow that also more easlily extends to other languages:
The past version was the result my (shrinking) bazel ignorance and maybe trusting the AI a little bit too much :)
Note
Medium Risk
Large change to how SBOMs are produced (many Bazel subprocesses, timeouts, partial results); mitigated by extensive tests and isolated output_user_root, but wrong hub/cquery behavior could miss or skew dependency lists.
Overview
Rewrites
socket manifest bazelMaven extraction so it scans every Bazel workspace under the repo (not onlycwd) and drives discovery/extraction through Bazel commands instead of parsing Starlark or reusingunsorted_deps.json/ cachedbazel queryprobe output.Discovery: A new workspace walker finds roots with
MODULE.bazel/WORKSPACE/WORKSPACE.bazel(caller-supplied ignore rules; CLI wiresIGNORED_DIRSplusbazel-*). Per workspace, Maven hubs come frombazel mod show_extensiononrules_jvm_external’s maven extension (Bzlmod), with conventional-name cquery probes and a populated / empty / not-defined classifier for legacy WORKSPACE mode and extras from--bazel-maven-repo=.Extraction: Each hub gets a metadata
cquery(jsonproto, union onmaven_coordinatestags/attrs/maven_url), parsed in newbazel-cquery.mts. Artifacts aggregate across workspaces and dedupe by full Maven coordinate before writing synthesizedmaven_install.json. Bazel runs use a per-invocation--output_user_root, with shutdown + tempdir cleanup and fresh roots after per-repo timeouts so hung servers don’t block the scan.Supporting changes: Maven repo discovery module shrinks to
show_extensionparsing + probe classification; query runner addsrunBazelModShowMavenExtension, shared startup flags, andbuildMavenProbeFor; PyPI keepsparseVisibleRepoCandidateslocally;ExtractedArtifact.ruleKindis a string for arbitrary rule classes from cquery.Reviewed by Cursor Bugbot for commit 414a9a6. Configure here.