Sync flow: use metadata for last offset fetch and cancel groupCtx #3206

Amogh-Bharadwaj · 2025-07-14T13:43:49Z

Consider the following scenario where you have a mirror to ClickHouse with 2 replicas:

Normalize takes a while due to wide tables on ClickHouse with some MVs on them and so let's say normalize is 10 batches behind sync flow.
Normalize is currently loading 10 batches of avro files into the raw table, and these queries run on replica A.
While this is happening, sync flow tries to open a destination connection to call destinationConnector.GetLastOffset, but we try to open a connection to replica B.
Let's say at the time we do the above 3rd step, replica B is down for some reason

In this case, sync flow never retries because normalizeLoop is stuck on normalizing batches and so does not get the chance to see that the syncDone channel is closed (the closing is done when sync flow errors out, in this case due to last offset erroring)
And so we get stuck on the errgroup waiting on normalize to finish -- so we never retry sync flow in this case

This PR:

Instead calls metadata.GetLastOffset rather than calling this method through connector construction which opens connections to the data stores -- an unneeded step when we just want to get something from catalog
Makes groupCtx cancelled if sync flow errors out. The two goroutines which rely on it are:
i) maintainReplConn which does not error out if groupCtx is cancelled
ii) normalizeLoop which anyway will exit because of syncDone

The whole purpose of cancelling groupCtx when syncErr != nil is so that we hit this branch which leads to syncflow activity retrying:

			if groupCtx.Err() != nil {
				// need to return ctx.Err(), avoid returning syncErr that's wrapped context canceled
				break
			}

Copilot

Pull Request Overview

This PR refactors how the sync flow retrieves its last offset (using a metadata-only client rather than opening full DB connectors) and ensures the errgroup’s context is cancelled on sync errors so retries can proceed.

Switch offset fetch to use external_metadata client instead of opening source/destination connectors
Wrap the errgroup in a cancellable context and call cancelFunc() when sync errors to trigger retries

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
flow/activities/flowable_core.go	Replaced connector-based `GetLastOffset` calls with metadata
flow/activities/flowable.go	Introduced `context.WithCancel` for the errgroup and invoke cancellation on sync failure

Comments suppressed due to low confidence (2)

flow/activities/flowable.go:387

The maintainReplConn goroutine may not return on context cancellation, causing errgroup.Wait() to hang and preventing retries. Ensure maintainReplConn watches groupCtx.Done() and returns promptly when the context is cancelled.

			cancelFunc()

flow/activities/flowable_core.go:143

[nitpick] Confirm whether the metadata client returned by NewPostgresMetadataFromCatalog holds resources that need explicit closing (e.g., a DB connection). If so, add a defer pgMetadata.Close() or equivalent to avoid resource leaks.

		return pgMetadata.GetLastOffset(ctx, flowName)

serprex · 2025-07-14T14:58:51Z

flow/activities/flowable_core.go

-
-			return dstConn.GetLastOffset(ctx, config.FlowJobName)
-		}
+		pgMetadata := connmetadata.NewPostgresMetadataFromCatalog(logger, a.CatalogPool)


this doesn't work correctly for pg to pg replication where we store metadata on destination peer instead of catalog

I've brought up wanting to change this, but for now you need to check if source & destination are both pg

In most cases you can return GetLastOffset from srcConn which'll usually do what you're doing here. The problem is that originally metadata was managed by destination connector

serprex · 2025-07-14T15:06:18Z

seems like we should be running the sync loop as another waitGroup.Go so it can return syncErr to cancel the errgroup

…h unless Postgres to Postgres (#3367) Decoupling from #3206 and addressing the [comment](#3206 (comment)) there. This PR makes it so that we do not connect to the destination peer to get the last offset for sync flow. --------- Co-authored-by: Philip Dubé <philip.dube@clickhouse.com>

use metadata for last offset fetch and cancel groupCtx

28c7b5a

Amogh-Bharadwaj requested review from Copilot and serprex July 14, 2025 13:43

Copilot AI reviewed Jul 14, 2025

View reviewed changes

serprex reviewed Jul 14, 2025

View reviewed changes

Amogh-Bharadwaj mentioned this pull request Aug 19, 2025

Optimisation: Do not open destination connection for last offset fetch unless Postgres to Postgres #3367

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync flow: use metadata for last offset fetch and cancel groupCtx #3206

Sync flow: use metadata for last offset fetch and cancel groupCtx #3206

Uh oh!

Amogh-Bharadwaj commented Jul 14, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

serprex Jul 14, 2025 •

edited

Loading

Uh oh!

serprex commented Jul 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sync flow: use metadata for last offset fetch and cancel groupCtx #3206

Are you sure you want to change the base?

Sync flow: use metadata for last offset fetch and cancel groupCtx #3206

Uh oh!

Conversation

Amogh-Bharadwaj commented Jul 14, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

serprex Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serprex commented Jul 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

serprex Jul 14, 2025 •

edited

Loading