Skip to content

feat(CLDSRV-884): Add OpenTelemetry tracing instrumentation#6140

Draft
delthas wants to merge 4 commits intodevelopment/9.3from
improvement/CLDSRV-884/otel-instrumentation
Draft

feat(CLDSRV-884): Add OpenTelemetry tracing instrumentation#6140
delthas wants to merge 4 commits intodevelopment/9.3from
improvement/CLDSRV-884/otel-instrumentation

Conversation

@delthas
Copy link
Copy Markdown
Contributor

@delthas delthas commented Apr 2, 2026

Not human-reviewed yet. Not asking for reviews at the moment!

Summary

Add OpenTelemetry tracing instrumentation to cloudserver, gated behind ENABLE_OTEL=true.

  • NodeSDK setup (lib/otel.js): OTLP/HTTP trace exporter, 1% default sampling ratio, auto-instrumentation for HTTP, Express, and MongoDB
  • Proxy-based instrumentation (lib/instrumentation/simple.js): auto-wrapping of vault, storage, metadata, api, and services components with low-cardinality span names
  • W3C trace context propagation (lib/server.js): extracts inbound traceparent/tracestate headers and scopes request processing under the remote context
  • API handler instrumentation (lib/api/api.js): all 70+ S3 API handlers wrapped with instrumentApiMethod() for per-operation spans

When ENABLE_OTEL is not set, there is zero overhead — the OTEL SDK is not loaded, @opentelemetry/api is not required in server.js, and instrumentApiMethod() returns the original function unchanged.

Origin of changes

Extracted and cleaned up from William Lardier's user/test/wlardier/servicemesh-2 branch (based on development/9.0, July–Sep 2025). The original branch mixed OTEL instrumentation with performance optimizations and debug artifacts. This PR contains only the OTEL instrumentation, rebased onto development/9.3.

What was removed vs the source branch:

  • Dead code: lib/otelContextPropagation.js (manual global.currentTraceHeaders hack, never imported)
  • Debug artifacts: ~15 console.log() statements, commented-out span code
  • Mock feature: MOCK_DOAUTH / lib/api/apiUtils/mock/backendMocks.js (caches first real auth result — dangerous in production, unrelated to OTEL)
  • Performance optimizations: GC_INTERVAL_MS / manual GC, monitorLatency(), releaseRequestContexts() pooling, arsenal perf pin — these will go in a separate PR
  • Arsenal dependency reverted to standard 8.3.8 (9.3's version) instead of William's perf-pinned commit 20a5fdc2

What was fixed during review:

  • @opentelemetry/sdk-trace-base version aligned to ^1.28.0 (compatible with sdk-node ^0.55.0)
  • OTEL_SAMPLING_RATIO=0 now correctly disables sampling (was falling back to 1% due to || vs ??)
  • Trace context extraction gated behind ENABLE_OTEL check (was running unconditionally)
  • No-callback code paths now wrapped in context.with() for proper parent-child span linking
  • Deprecated SemanticResourceAttributes replaced with string literals
  • service.version reads from package.json instead of hardcoded value
  • Redis-4 auto-instrumentation explicitly disabled

Context

  • Jira: CLDSRV-884
  • Parent investigation: OS-1072
  • Scality ADR mandates OpenTelemetry across all products
  • The storage layer (hdcontroller 1.12+ / hyperiod) already has full OTEL
  • The @opentelemetry/instrumentation-http auto-instrumentation should automatically inject traceparent into outgoing hdclient HTTP calls, connecting cloudserver traces to hdcontroller/hyperiod — this has never been tested end-to-end yet

Verification

  1. Deploy with ENABLE_OTEL=true and an OTEL collector endpoint
  2. Send S3 requests → check traces appear in Jaeger/Tempo
  3. Critical: deploy alongside hdcontroller 1.12+ (-t flag) and hyperiod (enable_tracer: true) — verify traces flow end-to-end as a single distributed trace

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 2, 2026

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 2, 2026

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 2, 2026

Codecov Report

❌ Patch coverage is 14.08451% with 122 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.81%. Comparing base (cebfbe5) to head (5b3590b).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
lib/otel.js 8.33% 77 Missing ⚠️
lib/instrumentation/simple.js 12.19% 36 Missing ⚠️
lib/tracing/healthPaths.js 0.00% 7 Missing ⚠️
lib/server.js 60.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
lib/api/api.js 91.48% <100.00%> (+0.15%) ⬆️
lib/server.js 79.42% <60.00%> (-0.19%) ⬇️
lib/tracing/healthPaths.js 0.00% <0.00%> (ø)
lib/instrumentation/simple.js 12.19% <12.19%> (ø)
lib/otel.js 8.33% <8.33%> (ø)

... and 3 files with indirect coverage changes

@@                 Coverage Diff                 @@
##           development/9.3    #6140      +/-   ##
===================================================
- Coverage            84.62%   83.81%   -0.82%     
===================================================
  Files                  206      209       +3     
  Lines                13322    13462     +140     
===================================================
+ Hits                 11274    11283       +9     
- Misses                2048     2179     +131     
Flag Coverage Δ
file-ft-tests 67.91% <14.08%> (-0.63%) ⬇️
kmip-ft-tests 28.22% <14.08%> (-0.16%) ⬇️
mongo-v0-ft-tests 69.11% <14.08%> (-0.60%) ⬇️
mongo-v1-ft-tests 69.12% <14.08%> (-0.66%) ⬇️
multiple-backend 36.49% <14.08%> (-0.25%) ⬇️
sur-tests 36.52% <14.08%> (-0.25%) ⬇️
sur-tests-inflights 37.51% <14.08%> (-0.31%) ⬇️
utapi-v2-tests 34.43% <14.08%> (-0.22%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread lib/otel.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
@claude
Copy link
Copy Markdown

claude Bot commented Apr 2, 2026

  • parseFloat(OTEL_SAMPLING_RATIO) can produce NaN for non-numeric input, which may cause TraceIdRatioBasedSampler to misbehave
    - Add a Number.isNaN guard with fallback to 0.01
    - args.findIndex picks the first function argument as the callback, but callbacks are always the last argument in this codebase
    - Use findLastIndex in both createInstrumentedProxy and instrumentApiMethod
    - In the callback-wrapping branch of both createInstrumentedProxy and instrumentApiMethod, a synchronous throw from the wrapped function leaks the span (never ended)
    - Wrap the apply call in try/catch to end the span
    - instrumentVault, instrumentStorage, instrumentMetadata, instrumentApi, instrumentServices are exported but never used anywhere in this PR
    - Remove them or wire them in

    Review by Claude Code

@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from 9c93a1d to 97e63fc Compare April 2, 2026 16:32
Comment thread lib/instrumentation/simple.js Outdated
Comment thread package.json Outdated
Comment thread lib/otel.js
Comment thread lib/otel.js Outdated
@claude
Copy link
Copy Markdown

claude Bot commented Apr 2, 2026

  • No graceful OTEL SDK shutdown — buffered traces lost on process exit (lib/otel.js:53)
    • Register SIGTERM/SIGINT handlers calling sdk.shutdown()
  • OTEL_SAMPLING_RATIO parsed with parseFloat but NaN not guarded (lib/otel.js:35)
    • Add Number.isFinite() check with fallback to 0.01
  • ~300 lines of dead code: instrumentVault, instrumentStorage, instrumentMetadata, instrumentApi, instrumentServices exported but never imported (lib/instrumentation/simple.js)
    • Remove until wired in, or add consumers in this PR
  • Version mismatch: auto-instrumentations-node@^0.50.2 bundles older versions of instrumentation-http/express/mongodb than the direct deps — risk of duplicate instrumentation (package.json)
    • Either bump auto-instrumentations-node or remove the individual direct deps

Review by Claude Code

@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from 97e63fc to f978280 Compare April 2, 2026 16:35
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/otel.js
Comment thread lib/instrumentation/simple.js
Comment thread lib/instrumentation/simple.js Outdated
@claude
Copy link
Copy Markdown

claude Bot commented Apr 2, 2026

  • lib/otel.js:51 — No graceful shutdown hook for the OTEL SDK; pending traces will be lost on SIGTERM/SIGINT
    - Add process.on('SIGTERM'/'SIGINT', () => sdk.shutdown())
    - lib/instrumentation/simple.js:396 — instrumentApiMethod no-callback path ends spans synchronously, unlike createInstrumentedProxy which handles promises
    - Add the same .then() check for promise-returning methods
    - lib/instrumentation/simple.js:23 — Proxy get trap does not guard against Symbol-typed prop, which would crash getSpanName()
    - Add: if (typeof prop === 'symbol') return originalValue
    - lib/instrumentation/simple.js:470 — instrumentVault, instrumentStorage, instrumentMetadata, instrumentApi, instrumentServices are exported but never used (~300 lines of dead code)
    - Remove and add when actually wired in

    Review by Claude Code

@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from f978280 to 06eea4e Compare April 2, 2026 16:49
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread package.json Outdated
@claude
Copy link
Copy Markdown

claude Bot commented Apr 2, 2026

  • Dead code: instrumentVault, instrumentStorage, instrumentMetadata, instrumentApi, and instrumentServices are exported from lib/instrumentation/simple.js but never imported anywhere. Only instrumentApiMethod is used. Remove them or wire them up.
    • Remove unused exports and the ~300 lines of proxy-wrapping code for vault/storage/metadata/api/services, or add the call sites that use them
  • Unconditional require: @opentelemetry/api is required at the top of lib/instrumentation/simple.js even when ENABLE_OTEL=false. This adds module loading overhead when OTEL is disabled.
    • Lazy-require it inside the enableOtel guard, consistent with lib/otel.js and lib/server.js
  • Loose version ranges on 0.x OTEL packages: ^0.50.2 and ^0.55.0 allow semver-minor bumps which can be breaking in 0.x.
    • Pin to exact versions or use tilde ranges (~0.55.0)
  • Double span.end() possible in callback path: If a method throws synchronously after having already called its callback, both the wrapped callback and the catch block will call span.end(). Benign but worth a comment.


Review by Claude Code

@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from 06eea4e to 75c3afb Compare April 2, 2026 17:02
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread package.json Outdated
@claude
Copy link
Copy Markdown

claude Bot commented May 5, 2026

  • arsenal pinned to branch improvement/ARSN-572/trace-context instead of a tag — violates project dependency pinning policy
    • Pin to an existing tag (e.g. 8.4.1) or cut a new arsenal tag that includes the trace-context changes before merging
  • Manual propagation.extract + context.with in lib/server.js:217 conflicts with @opentelemetry/instrumentation-http auto-instrumentation — breaks trace hierarchy for distributed traces by making API spans siblings of the HTTP server span instead of children
    • Remove the manual extraction block and rely on auto-instrumentation for incoming W3C context propagation
  • Double span.end() possible in lib/instrumentation/simple.js:96 callback path if the original callback throws synchronously
    • Track span-ended state with a boolean flag to skip the catch-block cleanup when the span was already ended in the wrapped callback

Review by Claude Code

@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch 2 times, most recently from 28598a8 to ba626f8 Compare May 6, 2026 09:34
Comment thread package.json
Comment thread package.json Outdated
Comment thread lib/server.js Outdated
@claude
Copy link
Copy Markdown

claude Bot commented May 6, 2026

  • Arsenal is pinned to a branch (improvement/ARSN-572/trace-context), not a tag — violates the project's dependency pinning policy for git-based deps. Must be updated to a tagged release before merge.
    - @types/pg resolution pinned to 8.6.1 appears unrelated to OTEL. Should be documented or removed to keep the PR focused.
    - Added return in routeRequest changes method contract without a corresponding consumer in this PR. Consider removing or documenting.

    Review by Claude Code

Comment thread package.json
Comment thread package.json Outdated
@claude
Copy link
Copy Markdown

claude Bot commented May 6, 2026

  • Arsenal dependency is pinned to a branch (improvement/ARSN-572/trace-context), not a tag. Must be pinned to a tagged release before merge for reproducible builds.
    • Tag the Arsenal changes and update the pin here.
  • @opentelemetry/auto-instrumentations-node drags in ~40 unused instrumentations (pg, mysql, kafkajs, cucumber, etc.), bloating yarn.lock by ~1100 lines and widening the vulnerability surface.
    • Replace with the four individual instrumentation packages actually enabled (http, express, mongodb, ioredis).

Review by Claude Code

@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch 2 times, most recently from 29b84e2 to 2aa4eef Compare May 6, 2026 11:14
Comment thread package.json
Comment thread lib/server.js Outdated
@claude
Copy link
Copy Markdown

claude Bot commented May 6, 2026

  • Arsenal dependency is pinned to a branch (improvement/ARSN-572/trace-context), not a tag — builds are not reproducible if the branch is rebased or force-pushed.
    • Pin to a release tag before merging.
  • cleanUp() chains shutdownOtel() with no .catch() — if the shutdown promise rejects past its internal guard, process.exit(0) never fires and the process hangs.
    • Add .catch() or use .finally(() => process.exit(0)).

Review by Claude Code

Comment thread package.json
Comment thread lib/instrumentation/simple.js
@claude
Copy link
Copy Markdown

claude Bot commented May 6, 2026

  • Arsenal dependency is pinned to a branch (improvement/ARSN-572/trace-context), not a tag. Must be updated to a tag before merge to ensure reproducible builds.
    - Pin to the Arsenal release tag that includes the trace-context changes.

    - instrumentApiMethod (the core instrumentation wrapper) has no unit tests despite complex span lifecycle logic (callback wrapping, double-end guard, async path).
    - Add tests for callback-based and promise-based handler wrapping, error propagation, and the OTEL-disabled passthrough.

    Review by Claude Code

…ring

Behind ENABLE_OTEL=true:
- NodeSDK with OTLP/HTTP exporter and ParentBased + ratio sampler
  so we honor upstream NGINX/Beyla sampling decisions instead of
  re-sampling them away.
- HTTP instrumentation with two hooks:
  - ignoreIncomingRequestHook drops spans on probe / scrape paths
    (/live, /ready, /_/healthcheck, /_/healthcheck/deep, /metrics)
    and OPTIONS preflight.
  - requestHook strips traceparent/tracestate on outbound requests
    to hosts not in TRUSTED_HOSTS, so trace IDs do not leak to
    external destinations. The client span is preserved (we still
    observe the call) and tagged scality.trace.suppressed=true.
- buildTrustedHosts derives the allowlist from cloudserver Config
  (vaultd, dataClient, metadataClient, bucketd, KMIP, KMS, scuba,
  utapi, mongodb, hdclient/sproxyd connectors from locationConfig,
  PUSH/MANAGEMENT_ENDPOINT env vars, plus loopback). A unit test
  asserts every Config host shape is covered so the derivation
  stays honest as new backends land.
- shutdownOtel() helper for the server's cleanUp() to await the
  exporter flush before process.exit so in-flight traces are not
  lost on SIGTERM.
- mongodb auto-instrumentation tuned for low cardinality; ioredis
  enabled with requireParentSpan; fs / redis (v2/v3/v4) / aws-sdk
  disabled.

When ENABLE_OTEL is unset the SDK and @opentelemetry/* packages are
not loaded at all - zero overhead off the OTEL path.

Issue: CLDSRV-884
@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from 2aa4eef to 6e0d08c Compare May 6, 2026 11:37
Comment thread package.json
"@opentelemetry/instrumentation-http": "~0.216.0",
"@opentelemetry/instrumentation-ioredis": "~0.64.0",
"@opentelemetry/instrumentation-mongodb": "~0.69.0",
"@opentelemetry/resources": "^2.7.0",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arsenal is pinned to a branch (improvement/ARSN-572/trace-context), not a tag. Git-based deps must pin to a tag to prevent unpredictable breakage when the branch moves. This should be updated to a tagged release before merging.

— Claude Code

Comment thread lib/instrumentation/simple.js
Comment thread lib/server.js Outdated
@claude
Copy link
Copy Markdown

claude Bot commented May 6, 2026

  • Arsenal is pinned to a branch (improvement/ARSN-572/trace-context), not a tag — git-based deps must pin to a tagged release before merging
    - Update to a tagged Arsenal release once the trace-context changes are released
    - instrumentApiMethod (the core callback/span wrapping logic) has no unit tests
    - Add tests for callback wrapping, async path, double-end guard, and OTEL-disabled passthrough
    - cleanUp() promise chain can hang if shutdownOtel() rejects unexpectedly — no .catch()/.finally()
    - Use .finally(() => process.exit(0)) to match the pattern already used in caughtExceptionShutdown

    Review by Claude Code

delthas added 3 commits May 6, 2026 13:42
Wire shutdownOtel() into the server's cleanUp() chain between closing
HTTP servers and process.exit(0). Without this, async sdk.shutdown()
fired by signal handlers can race the exit and lose buffered spans for
whatever was still in flight at SIGTERM time.

Inbound traceparent extraction is intentionally NOT done here:
@opentelemetry/instrumentation-http already calls propagation.extract
on every incoming request, creates a server span as a child of the
remote parent, and sets that server span as the active context. A
manual extract on top of that would replace the active context with
the (non-recording) remote parent and demote downstream api spans
to siblings - rather than children - of the HTTP server span,
breaking the trace hierarchy in exactly the distributed-tracing
scenarios the manual block was meant to support.

Issue: CLDSRV-884
Add lib/instrumentation/simple.js exporting instrumentApiMethod, a
wrapper that surrounds an S3 handler invocation with an OTEL span
named api.<methodName>. The span owns the entire handler execution
(auth, body parsing, metadata I/O, data path, finalizers) and
becomes the parent for any auto-instrumentation spans (HTTP,
MongoDB, ioredis) that fire underneath.

Span name is the handler name verbatim - objectGetACL stays
distinct from objectGet, objectPutTagging stays distinct from
objectPut. ~70 handlers means ~70 distinct span names total, well
within trace backend limits, and operators can tell variants apart
without reading attributes.

api.js applies the wrapper to every function-valued key in the api
object except for the dispatcher (callApiMethod) and pure helpers
(checkAuthResults, handleAuthorizationResults). New handlers added
to the literal are automatically instrumented - no per-handler
boilerplate to remember.

The wrapper handles callback / promise / sync return paths, sets
SpanStatusCode.OK or ERROR + recordException as appropriate, and
sets cloudserver.error_code on the error path. When ENABLE_OTEL is
unset @opentelemetry/api is not loaded and the wrapper returns the
original function unchanged.

Issue: CLDSRV-884
Temporarily point the arsenal dep at scality/Arsenal#improvement/ARSN-572/trace-context
so we can validate end-to-end trace context propagation from cloudserver
HTTP spans through to the MongoDB oplog on a test cluster. ARSN-572 adds
traceContext plumbing on metadata writes; cloudserver needs no code change
thanks to OTEL async context hooks.

Yarn resolves the branch ref and pins the resolved commit hash in
yarn.lock so installs are reproducible.

Revert to a clean #8.x release tag once ARSN-572 ships.

Issue: CLDSRV-884
@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from 6e0d08c to 5b3590b Compare May 6, 2026 11:42
Comment thread package.json
"@opentelemetry/sdk-trace-base": "^2.7.0",
"@smithy/node-http-handler": "^3.0.0",
"arsenal": "git+https://github.com/scality/Arsenal#8.4.1",
"arsenal": "git+https://github.com/scality/Arsenal#improvement/ARSN-572/trace-context",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arsenal is pinned to a branch (improvement/ARSN-572/trace-context) instead of a tag. Per project convention, git-based deps must pin to a tag. This should be updated to a released arsenal tag that includes the trace-context changes before merging.

— Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented May 6, 2026

  • Arsenal dependency is pinned to a branch (improvement/ARSN-572/trace-context) instead of a tag, violating the project's dependency pinning convention for git-based deps. This must be updated to a released tag before merging.

    The OTEL instrumentation itself is well-structured: gating behind ENABLE_OTEL=true ensures zero overhead when disabled, the trusted-host propagation filter prevents trace context leaking to external services, the instrumentApiMethod wrapper correctly handles callback/async/sync-throw paths with double-end guards, and shutdown flushing has a proper deadline to avoid blocking process exit. Tests cover the key edge cases thoroughly.

    Review by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants