feat(CLDSRV-884): Add OpenTelemetry tracing instrumentation#6140
feat(CLDSRV-884): Add OpenTelemetry tracing instrumentation#6140delthas wants to merge 4 commits intodevelopment/9.3from
Conversation
Hello delthas,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Request integration branchesWaiting for integration branch creation to be requested by the user. To request integration branches, please comment on this pull request with the following command: Alternatively, the |
Codecov Report❌ Patch coverage is
Additional details and impacted files
... and 3 files with indirect coverage changes @@ Coverage Diff @@
## development/9.3 #6140 +/- ##
===================================================
- Coverage 84.62% 83.81% -0.82%
===================================================
Files 206 209 +3
Lines 13322 13462 +140
===================================================
+ Hits 11274 11283 +9
- Misses 2048 2179 +131
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
9c93a1d to
97e63fc
Compare
Review by Claude Code |
97e63fc to
f978280
Compare
|
f978280 to
06eea4e
Compare
|
06eea4e to
75c3afb
Compare
Review by Claude Code |
28598a8 to
ba626f8
Compare
|
Review by Claude Code |
29b84e2 to
2aa4eef
Compare
Review by Claude Code |
|
…ring
Behind ENABLE_OTEL=true:
- NodeSDK with OTLP/HTTP exporter and ParentBased + ratio sampler
so we honor upstream NGINX/Beyla sampling decisions instead of
re-sampling them away.
- HTTP instrumentation with two hooks:
- ignoreIncomingRequestHook drops spans on probe / scrape paths
(/live, /ready, /_/healthcheck, /_/healthcheck/deep, /metrics)
and OPTIONS preflight.
- requestHook strips traceparent/tracestate on outbound requests
to hosts not in TRUSTED_HOSTS, so trace IDs do not leak to
external destinations. The client span is preserved (we still
observe the call) and tagged scality.trace.suppressed=true.
- buildTrustedHosts derives the allowlist from cloudserver Config
(vaultd, dataClient, metadataClient, bucketd, KMIP, KMS, scuba,
utapi, mongodb, hdclient/sproxyd connectors from locationConfig,
PUSH/MANAGEMENT_ENDPOINT env vars, plus loopback). A unit test
asserts every Config host shape is covered so the derivation
stays honest as new backends land.
- shutdownOtel() helper for the server's cleanUp() to await the
exporter flush before process.exit so in-flight traces are not
lost on SIGTERM.
- mongodb auto-instrumentation tuned for low cardinality; ioredis
enabled with requireParentSpan; fs / redis (v2/v3/v4) / aws-sdk
disabled.
When ENABLE_OTEL is unset the SDK and @opentelemetry/* packages are
not loaded at all - zero overhead off the OTEL path.
Issue: CLDSRV-884
2aa4eef to
6e0d08c
Compare
| "@opentelemetry/instrumentation-http": "~0.216.0", | ||
| "@opentelemetry/instrumentation-ioredis": "~0.64.0", | ||
| "@opentelemetry/instrumentation-mongodb": "~0.69.0", | ||
| "@opentelemetry/resources": "^2.7.0", |
There was a problem hiding this comment.
Arsenal is pinned to a branch (improvement/ARSN-572/trace-context), not a tag. Git-based deps must pin to a tag to prevent unpredictable breakage when the branch moves. This should be updated to a tagged release before merging.
— Claude Code
|
Wire shutdownOtel() into the server's cleanUp() chain between closing HTTP servers and process.exit(0). Without this, async sdk.shutdown() fired by signal handlers can race the exit and lose buffered spans for whatever was still in flight at SIGTERM time. Inbound traceparent extraction is intentionally NOT done here: @opentelemetry/instrumentation-http already calls propagation.extract on every incoming request, creates a server span as a child of the remote parent, and sets that server span as the active context. A manual extract on top of that would replace the active context with the (non-recording) remote parent and demote downstream api spans to siblings - rather than children - of the HTTP server span, breaking the trace hierarchy in exactly the distributed-tracing scenarios the manual block was meant to support. Issue: CLDSRV-884
Add lib/instrumentation/simple.js exporting instrumentApiMethod, a wrapper that surrounds an S3 handler invocation with an OTEL span named api.<methodName>. The span owns the entire handler execution (auth, body parsing, metadata I/O, data path, finalizers) and becomes the parent for any auto-instrumentation spans (HTTP, MongoDB, ioredis) that fire underneath. Span name is the handler name verbatim - objectGetACL stays distinct from objectGet, objectPutTagging stays distinct from objectPut. ~70 handlers means ~70 distinct span names total, well within trace backend limits, and operators can tell variants apart without reading attributes. api.js applies the wrapper to every function-valued key in the api object except for the dispatcher (callApiMethod) and pure helpers (checkAuthResults, handleAuthorizationResults). New handlers added to the literal are automatically instrumented - no per-handler boilerplate to remember. The wrapper handles callback / promise / sync return paths, sets SpanStatusCode.OK or ERROR + recordException as appropriate, and sets cloudserver.error_code on the error path. When ENABLE_OTEL is unset @opentelemetry/api is not loaded and the wrapper returns the original function unchanged. Issue: CLDSRV-884
Temporarily point the arsenal dep at scality/Arsenal#improvement/ARSN-572/trace-context so we can validate end-to-end trace context propagation from cloudserver HTTP spans through to the MongoDB oplog on a test cluster. ARSN-572 adds traceContext plumbing on metadata writes; cloudserver needs no code change thanks to OTEL async context hooks. Yarn resolves the branch ref and pins the resolved commit hash in yarn.lock so installs are reproducible. Revert to a clean #8.x release tag once ARSN-572 ships. Issue: CLDSRV-884
6e0d08c to
5b3590b
Compare
| "@opentelemetry/sdk-trace-base": "^2.7.0", | ||
| "@smithy/node-http-handler": "^3.0.0", | ||
| "arsenal": "git+https://github.com/scality/Arsenal#8.4.1", | ||
| "arsenal": "git+https://github.com/scality/Arsenal#improvement/ARSN-572/trace-context", |
There was a problem hiding this comment.
Arsenal is pinned to a branch (improvement/ARSN-572/trace-context) instead of a tag. Per project convention, git-based deps must pin to a tag. This should be updated to a released arsenal tag that includes the trace-context changes before merging.
— Claude Code
|
Not human-reviewed yet. Not asking for reviews at the moment!
Summary
Add OpenTelemetry tracing instrumentation to cloudserver, gated behind
ENABLE_OTEL=true.lib/otel.js): OTLP/HTTP trace exporter, 1% default sampling ratio, auto-instrumentation for HTTP, Express, and MongoDBlib/instrumentation/simple.js): auto-wrapping of vault, storage, metadata, api, and services components with low-cardinality span nameslib/server.js): extracts inboundtraceparent/tracestateheaders and scopes request processing under the remote contextlib/api/api.js): all 70+ S3 API handlers wrapped withinstrumentApiMethod()for per-operation spansWhen
ENABLE_OTELis not set, there is zero overhead — the OTEL SDK is not loaded,@opentelemetry/apiis not required inserver.js, andinstrumentApiMethod()returns the original function unchanged.Origin of changes
Extracted and cleaned up from William Lardier's
user/test/wlardier/servicemesh-2branch (based ondevelopment/9.0, July–Sep 2025). The original branch mixed OTEL instrumentation with performance optimizations and debug artifacts. This PR contains only the OTEL instrumentation, rebased ontodevelopment/9.3.What was removed vs the source branch:
lib/otelContextPropagation.js(manualglobal.currentTraceHeadershack, never imported)console.log()statements, commented-out span codeMOCK_DOAUTH/lib/api/apiUtils/mock/backendMocks.js(caches first real auth result — dangerous in production, unrelated to OTEL)GC_INTERVAL_MS/ manual GC,monitorLatency(),releaseRequestContexts()pooling, arsenal perf pin — these will go in a separate PR8.3.8(9.3's version) instead of William's perf-pinned commit20a5fdc2What was fixed during review:
@opentelemetry/sdk-trace-baseversion aligned to^1.28.0(compatible withsdk-node ^0.55.0)OTEL_SAMPLING_RATIO=0now correctly disables sampling (was falling back to 1% due to||vs??)ENABLE_OTELcheck (was running unconditionally)context.with()for proper parent-child span linkingSemanticResourceAttributesreplaced with string literalsservice.versionreads frompackage.jsoninstead of hardcoded valueContext
@opentelemetry/instrumentation-httpauto-instrumentation should automatically injecttraceparentinto outgoing hdclient HTTP calls, connecting cloudserver traces to hdcontroller/hyperiod — this has never been tested end-to-end yetVerification
ENABLE_OTEL=trueand an OTEL collector endpoint-tflag) and hyperiod (enable_tracer: true) — verify traces flow end-to-end as a single distributed trace