When tail-based sampling (TBS) is enabled, APM Server publishes finalized sampling decisions to the traces-apm.sampled-* data stream and subscribes to the same stream to consume decisions from peer APM Servers. To skip its own decisions during subscription, the query uses a must_not filter on agent.ephemeral_id, which is set to samplerUUID.
samplerUUID is a package-level var generated fresh on every process start:
|
// samplerUUID is a UUID used to identify sampled trace ID documents |
|
// published by this process. |
|
samplerUUID = uuid.Must(uuid.NewV4()) |
Because the UUID rotates on every restart, the self-filter for a restarted process no longer matches decisions the same server published in previous incarnations. Those decisions are then re-fetched from Elasticsearch even though they are already present in the local decision DB.
Nothing is functionally incorrect here (decisions are idempotent on the consumer side, keyed by trace ID). The cost is wasted bandwidth, CPU, and disk writes on every restart.
/cc @carsonip which helped uncover this
Behaviour
Single-instance deployments. Every document in the data stream is published by this instance, so every subscribe poll matches only self-docs and returns zero hits. maxObservedSeqno stays at -1 in searchIndexTraceIDs, the if maxSeqno > observedSeqno gate in searchTraceIDs never fires, and subscriber_position.json is never advanced past its initial state. On restart the resumed subscriber issues _seq_no > -1 AND agent.ephemeral_id != newUUID, which matches every document still retained in the data stream. The paginated loop drains it at 1000 docs per page.
With persistent storage the decision DB and subscriber_position.json both survive. This does not cause issues as the re-ingested decisions are idempotent overwrites. The visible effect is redundant network and disk activity.
The persistent-storage improvement from #4437 is effectively shadowed by the UUID rotation.
Multi-instance deployments. The position advances past peer-published decisions during normal operation, so the re-fetched window on restart is bounded to the tail of recently-written self-docs (between the last peer-observed _seq_no and the current global checkpoint). Still redundant, but with smaller overall impact.
Impact
Scales with throughput and ILM retention on traces-apm.sampled-* indices.
On restart, it leads to elevated CPU, disk, and network activity until the stream is drained.
No impact on ephemeral storage, as there the re-fetch is necessary anyway. Decisive impact on persistent storage where the re-fetch could be mostly avoided.
Open question
Was the per-process scoping of samplerUUID intentional? The comment at main.go:45-47 suggests yes but does not clarify for what purpose. Before opening a PR we need to confirm whether there is a correctness argument behind per-process identity.
When tail-based sampling (TBS) is enabled, APM Server publishes finalized sampling decisions to the
traces-apm.sampled-*data stream and subscribes to the same stream to consume decisions from peer APM Servers. To skip its own decisions during subscription, the query uses amust_notfilter onagent.ephemeral_id, which is set tosamplerUUID.samplerUUIDis a package-levelvargenerated fresh on every process start:apm-server/x-pack/apm-server/main.go
Lines 45 to 47 in e899b88
Because the UUID rotates on every restart, the self-filter for a restarted process no longer matches decisions the same server published in previous incarnations. Those decisions are then re-fetched from Elasticsearch even though they are already present in the local decision DB.
Nothing is functionally incorrect here (decisions are idempotent on the consumer side, keyed by trace ID). The cost is wasted bandwidth, CPU, and disk writes on every restart.
/cc @carsonip which helped uncover this
Behaviour
Single-instance deployments. Every document in the data stream is published by this instance, so every subscribe poll matches only self-docs and returns zero hits.
maxObservedSeqnostays at -1 insearchIndexTraceIDs, theif maxSeqno > observedSeqnogate insearchTraceIDsnever fires, andsubscriber_position.jsonis never advanced past its initial state. On restart the resumed subscriber issues_seq_no > -1 AND agent.ephemeral_id != newUUID, which matches every document still retained in the data stream. The paginated loop drains it at 1000 docs per page.With persistent storage the decision DB and
subscriber_position.jsonboth survive. This does not cause issues as the re-ingested decisions are idempotent overwrites. The visible effect is redundant network and disk activity.The persistent-storage improvement from #4437 is effectively shadowed by the UUID rotation.
Multi-instance deployments. The position advances past peer-published decisions during normal operation, so the re-fetched window on restart is bounded to the tail of recently-written self-docs (between the last peer-observed
_seq_noand the current global checkpoint). Still redundant, but with smaller overall impact.Impact
Scales with throughput and ILM retention on
traces-apm.sampled-*indices.On restart, it leads to elevated CPU, disk, and network activity until the stream is drained.
No impact on ephemeral storage, as there the re-fetch is necessary anyway. Decisive impact on persistent storage where the re-fetch could be mostly avoided.
Open question
Was the per-process scoping of
samplerUUIDintentional? The comment atmain.go:45-47suggests yes but does not clarify for what purpose. Before opening a PR we need to confirm whether there is a correctness argument behind per-process identity.