Skip to content

TBS: persist the sampling-decision publisher UUID across restarts to avoid redundant resync #20945

@endorama

Description

@endorama

When tail-based sampling (TBS) is enabled, APM Server publishes finalized sampling decisions to the traces-apm.sampled-* data stream and subscribes to the same stream to consume decisions from peer APM Servers. To skip its own decisions during subscription, the query uses a must_not filter on agent.ephemeral_id, which is set to samplerUUID.

samplerUUID is a package-level var generated fresh on every process start:

// samplerUUID is a UUID used to identify sampled trace ID documents
// published by this process.
samplerUUID = uuid.Must(uuid.NewV4())

Because the UUID rotates on every restart, the self-filter for a restarted process no longer matches decisions the same server published in previous incarnations. Those decisions are then re-fetched from Elasticsearch even though they are already present in the local decision DB.

Nothing is functionally incorrect here (decisions are idempotent on the consumer side, keyed by trace ID). The cost is wasted bandwidth, CPU, and disk writes on every restart.

/cc @carsonip which helped uncover this

Behaviour

Single-instance deployments. Every document in the data stream is published by this instance, so every subscribe poll matches only self-docs and returns zero hits. maxObservedSeqno stays at -1 in searchIndexTraceIDs, the if maxSeqno > observedSeqno gate in searchTraceIDs never fires, and subscriber_position.json is never advanced past its initial state. On restart the resumed subscriber issues _seq_no > -1 AND agent.ephemeral_id != newUUID, which matches every document still retained in the data stream. The paginated loop drains it at 1000 docs per page.

With persistent storage the decision DB and subscriber_position.json both survive. This does not cause issues as the re-ingested decisions are idempotent overwrites. The visible effect is redundant network and disk activity.
The persistent-storage improvement from #4437 is effectively shadowed by the UUID rotation.

Multi-instance deployments. The position advances past peer-published decisions during normal operation, so the re-fetched window on restart is bounded to the tail of recently-written self-docs (between the last peer-observed _seq_no and the current global checkpoint). Still redundant, but with smaller overall impact.

Impact

Scales with throughput and ILM retention on traces-apm.sampled-* indices.

On restart, it leads to elevated CPU, disk, and network activity until the stream is drained.

No impact on ephemeral storage, as there the re-fetch is necessary anyway. Decisive impact on persistent storage where the re-fetch could be mostly avoided.

Open question

Was the per-process scoping of samplerUUID intentional? The comment at main.go:45-47 suggests yes but does not clarify for what purpose. Before opening a PR we need to confirm whether there is a correctness argument behind per-process identity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions