Skip to content

Support tracing RAG retrieval in Spring AI 1.x plugin#808

Open
peachisai wants to merge 24 commits into
apache:mainfrom
peachisai:uat
Open

Support tracing RAG retrieval in Spring AI 1.x plugin#808
peachisai wants to merge 24 commits into
apache:mainfrom
peachisai:uat

Conversation

@peachisai
Copy link
Copy Markdown
Member

@peachisai peachisai commented May 31, 2026

  • If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #.
  • Update the CHANGES log.
image

peachisai added 16 commits May 30, 2026 08:59
# Conflicts:
#	apm-sniffer/apm-sdk-plugin/spring-plugins/spring-ai-1.x-plugin/src/main/java/org/apache/skywalking/apm/plugin/spring/ai/v1/AbstractObservationVectorStoreInterceptor.java
# Conflicts:
#	apm-sniffer/apm-sdk-plugin/spring-plugins/spring-ai-1.x-plugin/src/main/java/org/apache/skywalking/apm/plugin/spring/ai/v1/AbstractObservationVectorStoreInterceptor.java
# Conflicts:
#	apm-sniffer/apm-sdk-plugin/spring-plugins/spring-ai-1.x-plugin/src/main/java/org/apache/skywalking/apm/plugin/spring/ai/v1/AbstractObservationVectorStoreInterceptor.java
# Conflicts:
#	apm-sniffer/apm-sdk-plugin/spring-plugins/spring-ai-1.x-plugin/src/main/java/org/apache/skywalking/apm/plugin/spring/ai/v1/AbstractObservationVectorStoreConstructorInterceptor.java
#	apm-sniffer/apm-sdk-plugin/spring-plugins/spring-ai-1.x-plugin/src/main/java/org/apache/skywalking/apm/plugin/spring/ai/v1/define/AbstractObservationVectorStoreInstrumentation.java
@wu-sheng wu-sheng added this to the 9.7.0 milestone May 31, 2026
@wu-sheng wu-sheng added the enhancement New feature or request label May 31, 2026
Copy link
Copy Markdown
Member

@wu-sheng wu-sheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, and nice work — CI is green and the vector-store retrieval tracing works end-to-end (I confirmed the embedding model name is captured through the enhanced abstract-base constructor, so gen_ai.request.model is populated as expected). A few things I'd like addressed before merge:

1. Avoid reflection on private fields

AbstractObservationVectorStoreConstructorInterceptor#resolveModelFromOptionsField() reads the options/defaultOptions fields via getDeclaredField(...) + setAccessible(true). We avoid reflective access to private fields of third-party classes in plugins — it silently breaks across Spring AI versions if a field is renamed (the catch (Throwable) swallows it and the model tag just disappears, with no signal). Please resolve the model name via public API instead, e.g. EmbeddingRequest.getOptions().getModel() captured at the embed call, or via the builder's public getEmbeddingModel(). If neither is clean for a given impl, prefer leaving gen_ai.request.model unset over private-field reflection.

2. ErrorTypeResolver value scheme is inconsistent

resolve() mixes a fully-qualified class name (UnknownHostException.class.getName()) with custom low-cardinality tokens ("timeout", "server_certificate_invalid") and an "_OTHER" catch-all. The _OTHER default discards classification for everything that isn't one of the three special cases (connection refused, 4xx/5xx from the provider, rate-limit, auth failures, …). Please pick one scheme: keep the tokens for the cases you explicitly handle, but for the fallback use throwable.getClass().getName() instead of _OTHER so unrecognized errors stay diagnosable. (Also, UnknownHostException is currently the odd one out emitting an FQN while the rest are tokens.)

3. Span lifecycle robustness for non-SimpleVectorStore stores

The instrumentation matches the whole AbstractObservationVectorStore hierarchy, so the interceptor runs for every subclass. In beforeMethod, createObservationContextBuilder(...).build() and resolveDataSourceId() run before createExitSpan(...). If some store's builder throws, the agent wrapper swallows it but afterMethod still executes — and because the parent span is active, it would stopSpan() / tag the parent span instead of the (never-created) retrieval span, corrupting the parent segment. SimpleVectorStore is safe (its builder is a pure factory), but please make it robust for other stores: create the exit span first (compute dataSourceId afterward), or wrap the pre-span work in try/catch and fall back to objInst.getClass().getSimpleName().

4. Update CHANGES.md and docs

Please add a CHANGES.md entry for the new RAG / vector-store retrieval tracing, the three new GenAI tags (gen_ai.data_source.id, gen_ai.retrieval.documents, gen_ai.retrieval.query.text), and the error.type resolution, plus a short note in the spring-ai plugin docs. (No new component id or UI logo is needed — reusing the spring-ai component / id 178 is correct.)

@peachisai peachisai requested a review from wu-sheng June 2, 2026 12:33
Copy link
Copy Markdown
Member

@wu-sheng wu-sheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few notes on the new vector-store retrieval tracing (inline below).

The main one is the retrieval query text being captured by default — it bypasses the plugin's existing default-off privacy posture for prompt/completion content, so a user who relies on the defaults to avoid logging user input would still get the RAG query recorded. The rest are small robustness/consistency cleanups.

Otherwise the change looks good — the abstract-base-class constructor interception for resolving the embedding model name is a nice touch.

}
try {
if (ret instanceof List<?>) {
Tags.GEN_AI_RETRIEVAL_DOCUMENTS.set(ContextManager.activeSpan(), toDocumentsJson((List<?>) ret));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gen_ai.retrieval.documents is likewise emitted unconditionally, unlike COLLECT_OUTPUT_MESSAGES which gates analogous response content (default false).

Lower risk than the query text since only id + score are serialized (not document bodies), but for consistency consider a COLLECT_RETRIEVAL_DOCUMENTS = false flag. Note the JSON grows with topK and has no size cap.


@Override
public Object afterMethod(EnhancedInstance objInst, Method method, Object[] allArguments, Class<?>[] argumentsTypes, Object ret) {
EmbeddingResponseMetadata metadata = ((EmbeddingResponse) ret).getMetadata();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

((EmbeddingResponse) ret).getMetadata() runs with no guard on ret. The agent calls afterMethod from a finally, so it also runs when call(EmbeddingRequest) threw — and then ret is null, NPE-ing on every embedding failure (caught + logged by the framework, but it spams ERROR logs).

ChatModelCallInterceptor.afterMethod guards with if (!(ret instanceof ChatResponse)) { return ret; } — suggest the same pattern here. The metadata == null check below then becomes dead, since getMetadata() never returns null.

@wu-sheng
Copy link
Copy Markdown
Member

wu-sheng commented Jun 2, 2026

A few notes on the latest fix (commit 2219fe7e8d):

1. apm-sniffer/config/agent.config:371 — the query-length limit can't actually be configured.
ConfigInitializer derives the property key from the field name (Plugin.SpringAi.RETRIEVAL_QUERY_LENGTH_LIMITplugin.springai.retrieval_query_length_limit) and only binds when that exact key is present. The key here is plugin.springai.retrieval_query, which matches no field, so RETRIEVAL_QUERY_LENGTH_LIMIT stays at the hard-coded 1024 and can't be changed via agent.config or the SW_... env var. Suggest renaming to:

plugin.springai.retrieval_query_length_limit=${SW_PLUGIN_SPRINGAI_RETRIEVAL_QUERY_LENGTH_LIMIT:1024}

(The sibling collect_retrieval_query / collect_retrieval_documents keys are fine.)

2. SpringAiPluginConfig.java:88 — nit: double space before = in COLLECT_RETRIEVAL_DOCUMENTS = false.

3. AbstractObservationVectorStoreInterceptor.java:82-86 — nit: the truncation guard limit > 0 means a value of 0 also collects the full query, i.e. 0 behaves the same as the documented negative = no limit. Fine if intentional (it's consistent with the chat interceptors), but 0 reads like it should mean "collect nothing".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants