Support tracing RAG retrieval in Spring AI 1.x plugin#808
Conversation
# Conflicts: # apm-sniffer/apm-sdk-plugin/spring-plugins/spring-ai-1.x-plugin/src/main/java/org/apache/skywalking/apm/plugin/spring/ai/v1/AbstractObservationVectorStoreInterceptor.java
# Conflicts: # apm-sniffer/apm-sdk-plugin/spring-plugins/spring-ai-1.x-plugin/src/main/java/org/apache/skywalking/apm/plugin/spring/ai/v1/AbstractObservationVectorStoreInterceptor.java
# Conflicts: # apm-sniffer/apm-sdk-plugin/spring-plugins/spring-ai-1.x-plugin/src/main/java/org/apache/skywalking/apm/plugin/spring/ai/v1/AbstractObservationVectorStoreInterceptor.java
# Conflicts: # apm-sniffer/apm-sdk-plugin/spring-plugins/spring-ai-1.x-plugin/src/main/java/org/apache/skywalking/apm/plugin/spring/ai/v1/AbstractObservationVectorStoreConstructorInterceptor.java # apm-sniffer/apm-sdk-plugin/spring-plugins/spring-ai-1.x-plugin/src/main/java/org/apache/skywalking/apm/plugin/spring/ai/v1/define/AbstractObservationVectorStoreInstrumentation.java
wu-sheng
left a comment
There was a problem hiding this comment.
Thanks for the contribution, and nice work — CI is green and the vector-store retrieval tracing works end-to-end (I confirmed the embedding model name is captured through the enhanced abstract-base constructor, so gen_ai.request.model is populated as expected). A few things I'd like addressed before merge:
1. Avoid reflection on private fields
AbstractObservationVectorStoreConstructorInterceptor#resolveModelFromOptionsField() reads the options/defaultOptions fields via getDeclaredField(...) + setAccessible(true). We avoid reflective access to private fields of third-party classes in plugins — it silently breaks across Spring AI versions if a field is renamed (the catch (Throwable) swallows it and the model tag just disappears, with no signal). Please resolve the model name via public API instead, e.g. EmbeddingRequest.getOptions().getModel() captured at the embed call, or via the builder's public getEmbeddingModel(). If neither is clean for a given impl, prefer leaving gen_ai.request.model unset over private-field reflection.
2. ErrorTypeResolver value scheme is inconsistent
resolve() mixes a fully-qualified class name (UnknownHostException.class.getName()) with custom low-cardinality tokens ("timeout", "server_certificate_invalid") and an "_OTHER" catch-all. The _OTHER default discards classification for everything that isn't one of the three special cases (connection refused, 4xx/5xx from the provider, rate-limit, auth failures, …). Please pick one scheme: keep the tokens for the cases you explicitly handle, but for the fallback use throwable.getClass().getName() instead of _OTHER so unrecognized errors stay diagnosable. (Also, UnknownHostException is currently the odd one out emitting an FQN while the rest are tokens.)
3. Span lifecycle robustness for non-SimpleVectorStore stores
The instrumentation matches the whole AbstractObservationVectorStore hierarchy, so the interceptor runs for every subclass. In beforeMethod, createObservationContextBuilder(...).build() and resolveDataSourceId() run before createExitSpan(...). If some store's builder throws, the agent wrapper swallows it but afterMethod still executes — and because the parent span is active, it would stopSpan() / tag the parent span instead of the (never-created) retrieval span, corrupting the parent segment. SimpleVectorStore is safe (its builder is a pure factory), but please make it robust for other stores: create the exit span first (compute dataSourceId afterward), or wrap the pre-span work in try/catch and fall back to objInst.getClass().getSimpleName().
4. Update CHANGES.md and docs
Please add a CHANGES.md entry for the new RAG / vector-store retrieval tracing, the three new GenAI tags (gen_ai.data_source.id, gen_ai.retrieval.documents, gen_ai.retrieval.query.text), and the error.type resolution, plus a short note in the spring-ai plugin docs. (No new component id or UI logo is needed — reusing the spring-ai component / id 178 is correct.)
wu-sheng
left a comment
There was a problem hiding this comment.
A few notes on the new vector-store retrieval tracing (inline below).
The main one is the retrieval query text being captured by default — it bypasses the plugin's existing default-off privacy posture for prompt/completion content, so a user who relies on the defaults to avoid logging user input would still get the RAG query recorded. The rest are small robustness/consistency cleanups.
Otherwise the change looks good — the abstract-base-class constructor interception for resolving the embedding model name is a nice touch.
| } | ||
| try { | ||
| if (ret instanceof List<?>) { | ||
| Tags.GEN_AI_RETRIEVAL_DOCUMENTS.set(ContextManager.activeSpan(), toDocumentsJson((List<?>) ret)); |
There was a problem hiding this comment.
gen_ai.retrieval.documents is likewise emitted unconditionally, unlike COLLECT_OUTPUT_MESSAGES which gates analogous response content (default false).
Lower risk than the query text since only id + score are serialized (not document bodies), but for consistency consider a COLLECT_RETRIEVAL_DOCUMENTS = false flag. Note the JSON grows with topK and has no size cap.
|
|
||
| @Override | ||
| public Object afterMethod(EnhancedInstance objInst, Method method, Object[] allArguments, Class<?>[] argumentsTypes, Object ret) { | ||
| EmbeddingResponseMetadata metadata = ((EmbeddingResponse) ret).getMetadata(); |
There was a problem hiding this comment.
((EmbeddingResponse) ret).getMetadata() runs with no guard on ret. The agent calls afterMethod from a finally, so it also runs when call(EmbeddingRequest) threw — and then ret is null, NPE-ing on every embedding failure (caught + logged by the framework, but it spams ERROR logs).
ChatModelCallInterceptor.afterMethod guards with if (!(ret instanceof ChatResponse)) { return ret; } — suggest the same pattern here. The metadata == null check below then becomes dead, since getMetadata() never returns null.
|
A few notes on the latest fix (commit 1. (The sibling 2. 3. |
CHANGESlog.