Skip to content

[+] implement parallel source discovery#1378

Merged
pashagolub merged 4 commits intomasterfrom
parallel-source-discovery
Apr 29, 2026
Merged

[+] implement parallel source discovery#1378
pashagolub merged 4 commits intomasterfrom
parallel-source-discovery

Conversation

@pashagolub
Copy link
Copy Markdown
Collaborator

@pashagolub pashagolub commented Apr 24, 2026

Improves dead-source handling with parallel resolution and instance_up=0 on discovery failure.

Sources.ResolveDatabases() previously resolved each source sequentially. A single slow or unresponsive source (e.g. a continuous-discovery endpoint behind a firewall) would block discovery of all subsequent sources for the full connection timeout duration.

Sources are now resolved concurrently using sync.WaitGroup.Go(). Results are collected into a pre-allocated indexed slice to preserve deterministic ordering. Per-source error logging with source name is included in the resolver itself.

When a SourcePostgresContinuous or SourcePatroni source fails to resolve any databases, LoadSources() now emits instance_up=0 to the configured sinks. This makes the failure visible in dashboards and alerting, consistent with how unreachable directly-monitored sources are handled.

@pashagolub pashagolub self-assigned this Apr 24, 2026
@pashagolub pashagolub requested a review from 0xgouda April 24, 2026 12:27
@pashagolub pashagolub added the sources What sources and in what way to monitor label Apr 24, 2026
@coveralls
Copy link
Copy Markdown

coveralls commented Apr 24, 2026

Coverage Report for CI Build 25120163997

Coverage decreased (-0.3%) to 83.065%

Details

  • Coverage decreased (-0.3%) from the base build.
  • Patch coverage: 23 uncovered changes across 2 files (6 of 29 lines covered, 20.69%).
  • 6 coverage regressions across 1 file.

Uncovered Changes

File Changed Covered %
internal/sources/resolver.go 21 0 0.0%
internal/reaper/reaper.go 6 4 66.67%

Coverage Regressions

6 previously-covered lines in 1 file lost coverage.

File Lines Losing Coverage Coverage
internal/reaper/reaper.go 6 35.53%

Coverage Stats

Coverage Status
Relevant Lines: 5338
Covered Lines: 4434
Line Coverage: 83.06%
Coverage Strength: 0.95 hits per line

💛 - Coveralls

@pashagolub pashagolub force-pushed the parallel-source-discovery branch from a3814f5 to a715672 Compare April 24, 2026 18:41
@0xgouda
Copy link
Copy Markdown
Collaborator

0xgouda commented Apr 27, 2026

Is this related to #1377?

@pashagolub
Copy link
Copy Markdown
Collaborator Author

Is this related to #1377?

In some way. I couldn't reproduce that issue but I found that misconfigured discovery sources could cause a huge delays.

@0xgouda 0xgouda force-pushed the parallel-source-discovery branch from a715672 to 75ea9bd Compare April 28, 2026 00:40
Comment on lines +48 to +50
if onError != nil {
onError(srcs[i].Name)
}
Copy link
Copy Markdown
Collaborator

@0xgouda 0xgouda Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a concern here that can be reproduced with the following steps:

  1. Define a target that happens to be unreachable and is of the kind postgres-continuous-discovery
  2. pgwatch writes instance_up=0 for the target for a while with dbname = sourceName
  3. The target becomes alive
  4. pgwatch runs for a while and now writes the updated instance_up = 1, but with a new dbname = sourceName + _ + realDbname
  5. The target is down again and its instance_up = 0 is written with dbname = sourceName + _ + realDbname

So the full instance uptime history becomes a bit disconnected, with different dbname[s].

But generally, I think that's the best we can do, just wanted to note this behaviour.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! We could use source instead of dbname. This way we will know for sure at which point that happened

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more?

Copy link
Copy Markdown
Collaborator Author

@pashagolub pashagolub Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// WriteInstanceDown writes instance_up = 0 metric to sinks for the given source
func (r *Reaper) WriteInstanceDown(md *sources.SourceConn) {
	r.measurementCh <- metrics.MeasurementEnvelope{
		DBName:     md.Name,
		MetricName: specialMetricInstanceUp,
		Data: metrics.Measurements{metrics.Measurement{
			metrics.EpochColumnName: time.Now().UnixNano(),
			"kind": string(md.Kind),
		//  ^^^^^^^^^^^^^^^^^^^^^^^
			specialMetricInstanceUp: 0},
		},
	}
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this way grafana could distinguish regular databases vs all other

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think adding the kind to the measurements will help.

What if we have 2+ postgres-continuous-discovery/patroni sources? Grouping by kind then will be useless.

Copy link
Copy Markdown
Collaborator

@0xgouda 0xgouda Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And again, I guess what we have already is the best we can do, since we won't be able to add any instance identifiers (i.e., sys_id) to the measurements before connecting at least once.

But then, in Grafana, one might have 2 panels like:
instance uptime history grouped by sys_id
and
instance uptime history for instances with unknown sys_id

and they should complement each other, to give the full uptime history view.

Improves dead-source handling with parallel resolution and instance_up=0
on discovery failure.

`Sources.ResolveDatabases()` previously resolved each source
sequentially. A single slow or unresponsive source
(e.g. a continuous-discovery endpoint behind a firewall) would block
discovery of all subsequent sources for the full connection timeout
duration.

Sources are now resolved concurrently using `sync.WaitGroup.Go()`.
Results are collected into a pre-allocated indexed slice to preserve
deterministic ordering. Per-source error logging with source name is
included in the resolver itself.

When a `SourcePostgresContinuous` or `SourcePatroni` source fails to
resolve any databases, `LoadSources()` now emits `instance_up=0` to
the configured sinks. This makes the failure visible in dashboards and
alerting, consistent with how unreachable directly-monitored sources are
handled.
@pashagolub pashagolub force-pushed the parallel-source-discovery branch from 75ea9bd to a47b7fb Compare April 29, 2026 15:39
@pashagolub pashagolub merged commit 1677cdb into master Apr 29, 2026
9 checks passed
@pashagolub pashagolub deleted the parallel-source-discovery branch April 29, 2026 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sources What sources and in what way to monitor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants