Ldz/add log0721 #1570

lidezhu · 2025-07-24T10:27:46Z

What problem does this PR solve?

Issue Number: close #xxx

What is changed and how it works?

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

ti-chi-bot · 2025-07-24T10:27:49Z

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

_{📖 For more info, you can check the "Contribute Code" section in the development guide.}

ti-chi-bot · 2025-07-24T10:27:49Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign charlescheung96 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist

Summary of Changes

Hello @lidezhu, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the observability and monitoring capabilities within the log service. It achieves this by promoting key debug logs to info level, introducing new internal state tracking for subscriptions, adding specific warning logs for resolved timestamp (TS) lag, and implementing a new Prometheus metric to measure dispatcher notification duration. The changes aim to provide better insights into the health and performance of data synchronization, helping to quickly identify and diagnose potential bottlenecks.

Highlights

Enhanced Observability Logging: Many existing log.Debug statements across logservice/logpuller and logservice/eventstore components have been upgraded to log.Info level. This change makes critical events, such as region errors, region initialization, range lock operations, and subscription client activities, more visible in standard log outputs, improving overall system observability.
Improved Resolved Timestamp (TS) Monitoring: New fields (initialized, lastAdvanceTime, lastReceiveDMLTime) have been added to subscriptionStat in event_store.go to track the state and activity of subscriptions. Warning logs are now emitted when an initialized subscription's resolved TS lag exceeds 10 seconds, or when a large resolved TS advance step (>10s) is detected, providing proactive alerts for potential data synchronization delays.
New Performance Metric for Dispatchers: A new Prometheus histogram, EventStoreNotifyDispatcherDurationHist, has been introduced in pkg/metrics/event_store.go and integrated into event_store.go. This metric measures the duration it takes to notify dispatchers with resolved timestamps, offering valuable insights into the performance of the event delivery pipeline.
Extended RangeLock Information: The ResolvedTs() method in logservice/logpuller/regionlock/region_range_lock.go has been modified to return not only the minimum resolved timestamp but also the regionID associated with that timestamp. This provides more context for identifying specific regions contributing to resolved TS lag.
Dedicated Slow Span Logging: A new goroutine, logSlowSpan, has been added to logservice/logpuller/subscription_client.go. This routine periodically checks and logs warnings for any initialized spans where the resolved TS lag exceeds 10 seconds, complementing the existing slow region logging.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces additional logging and metrics to improve observability of the log service, particularly around resolved timestamp lag and subscription states. The changes are spread across eventstore, logpuller, and metrics packages.

While the new logging is valuable, several log messages have been elevated from Debug to Info in high-frequency code paths, which could lead to log flooding and performance issues. Additionally, there are several instances of magic numbers for thresholds and intervals that should be defined as constants. I've also identified a metric that is being calculated incorrectly.

My feedback focuses on addressing these maintainability and correctness issues to ensure the new additions are robust and don't negatively impact the system.

gemini-code-assist · 2025-07-24T10:30:00Z

logservice/eventstore/event_store.go

@@ -502,6 +514,7 @@ func (e *eventStore) RegisterDispatcher(
 				notifier(ts, subStat.maxEventCommitTs.Load())
 			}
 			CounterResolved.Inc()
+			metrics.EventStoreNotifyDispatcherDurationHist.Observe(float64(time.Since(start).Milliseconds()) / 1000)


The metric EventStoreNotifyDispatcherDurationHist is being updated with time.Since(start). The start variable is captured from the RegisterDispatcher function's scope, which means this will measure the time since the dispatcher was registered, not the duration of this specific notification. This will result in misleadingly large values for the metric.

The duration should be measured within this if block to accurately reflect the notification time. For example, by recording time.Now() at the beginning of the if block and using it to calculate the duration.

gemini-code-assist · 2025-07-24T10:30:01Z

logservice/eventstore/event_store.go

@@ -805,6 +818,22 @@ func (e *eventStore) updateMetricsOnce() {
 			resolvedTs := subStat.resolvedTs.Load()
 			resolvedPhyTs := oracle.ExtractPhysical(resolvedTs)
 			resolvedLag := float64(pdPhyTs-resolvedPhyTs) / 1e3
+			if subStat.initialized.Load() && resolvedLag >= 10 {


The magic number 10 is used as a threshold for the resolved timestamp lag in seconds. To improve readability and maintainability, it's better to define this as a named constant at the package level. This value is used in multiple places in this pull request.

const largeResolvedTsLagInSecs = 10 if subStat.initialized.Load() && resolvedLag >= largeResolvedTsLagInSecs {

gemini-code-assist · 2025-07-24T10:30:01Z

logservice/logpuller/region_event_handler.go

 		// Generally, we don't want to send duplicate resolved ts,
 		// so we check whether `ts` is larger than `lastResolvedTs` before send it.
 		// but when `ts` == `lastResolvedTs` == `span.startTs`,
 		// the span may just be initialized and have not receive any resolved ts before,
 		// so we also send ts in this case for quick notification to downstream.
 		if ts > lastResolvedTs || (ts == lastResolvedTs && lastResolvedTs == span.startTs) {
+			resolvedPhyTs := oracle.ExtractPhysical(lastResolvedTs)
+			decreaseLag := float64(nextResolvedPhyTs-resolvedPhyTs) / 1e3
+			if decreaseLag > 10 {


The magic number 10 is used as a threshold for a large resolved timestamp advance step. This should be defined as a named constant to improve code clarity and maintainability, especially since this value is used elsewhere.

const largeResolvedTsAdvanceStepInSecs = 10 if decreaseLag > largeResolvedTsAdvanceStepInSecs {

gemini-code-assist · 2025-07-24T10:30:01Z

logservice/logpuller/subscription_client.go

+		log.Info("subscription client is going to load regions",
 			zap.Uint64("subscriptionID", uint64(subscribedSpan.subID)),
 			zap.Any("span", nextSpan))


This log message "subscription client is going to load regions" is inside a loop. While it's before a network call, logging it at Info level might still be too verbose if many small ranges are processed. Debug level would be more suitable for this kind of "about to do X" message.

Suggested change

log.Info("subscription client is going to load regions",

zap.Uint64("subscriptionID", uint64(subscribedSpan.subID)),

zap.Any("span", nextSpan))

log.Debug("subscription client is going to load regions",

zap.Uint64("subscriptionID", uint64(subscribedSpan.subID)),

zap.Any("span", nextSpan))

gemini-code-assist · 2025-07-24T10:30:01Z

logservice/logpuller/subscription_client.go

+	ticker := time.NewTicker(10 * time.Second)
+	defer ticker.Stop()
+	for {
+		select {
+		case <-ctx.Done():
+			return ctx.Err()
+		case <-ticker.C:
+		}
+
+		pdTime := s.pdClock.CurrentTime()
+		pdPhyTs := oracle.GetPhysical(pdTime)
+		s.totalSpans.RLock()
+		for subID, rt := range s.totalSpans.spanMap {
+			if !rt.initialized.Load() {
+				continue
+			}
+			resolvedTs := rt.rangeLock.GetHeapMinTs()
+			resolvedPhyTs := oracle.ExtractPhysical(resolvedTs)
+			resolvedLag := float64(pdPhyTs-resolvedPhyTs) / 1e3
+			if resolvedLag > 10 {


This new function logSlowSpan uses a couple of magic numbers:

10 * time.Second for the ticker interval (line 959)

10 for the lag threshold in seconds (line 978)

These should be defined as named constants at the package level to improve readability and maintainability.

Suggested change

ticker := time.NewTicker(10 * time.Second)

defer ticker.Stop()

for {

select {

case <-ctx.Done():

return ctx.Err()

case <-ticker.C:

}

pdTime := s.pdClock.CurrentTime()

pdPhyTs := oracle.GetPhysical(pdTime)

s.totalSpans.RLock()

for subID, rt := range s.totalSpans.spanMap {

if !rt.initialized.Load() {

continue

}

resolvedTs := rt.rangeLock.GetHeapMinTs()

resolvedPhyTs := oracle.ExtractPhysical(resolvedTs)

resolvedLag := float64(pdPhyTs-resolvedPhyTs) / 1e3

if resolvedLag > 10 {

const logSlowSpanTickerInterval = 10 * time.Second

const resolvedLagThresholdInSecs = 10

ticker := time.NewTicker(logSlowSpanTickerInterval)

if resolvedLag > resolvedLagThresholdInSecs {

gemini-code-assist · 2025-07-24T10:30:01Z

pkg/metrics/event_store.go

 	EventStoreWriteDurationHistogram = prometheus.NewHistogram(prometheus.HistogramOpts{
 		Namespace: "ticdc",
 		Subsystem: "event_store",
 		Name:      "write_duration",
 		Help:      "Bucketed histogram of event store write duration",
-		Buckets:   prometheus.ExponentialBuckets(0.00004, 2.0, 10),
+		Buckets:   prometheus.ExponentialBuckets(0.00004, 2.0, 10), // 0.1ms ~ 52s,


The comment // 0.1ms ~ 52s, seems incorrect for the buckets defined by prometheus.ExponentialBuckets(0.00004, 2.0, 10). The buckets for this metric range from 40µs to approximately 20.48ms. The comment was likely misplaced from another metric definition and should be corrected or removed to avoid confusion.

Suggested change

Buckets: prometheus.ExponentialBuckets(0.00004, 2.0, 10), // 0.1ms ~ 52s,

Buckets: prometheus.ExponentialBuckets(0.00004, 2.0, 10),

ti-chi-bot · 2025-07-24T11:00:43Z

@lidezhu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-unit-test	`f70ac1a`	link	true	`/test pull-unit-test`
pull-cdc-mysql-integration-light	`f70ac1a`	link	true	`/test pull-cdc-mysql-integration-light`
pull-cdc-mysql-integration-heavy	`f70ac1a`	link	true	`/test pull-cdc-mysql-integration-heavy`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

lidezhu added 12 commits July 21, 2025 17:05

add some log

577f342

add more log

4420252

more log

2e654b6

remove log

b45be21

add more log

6e8e884

add more log

ba33880

add more log

4723eb2

check

5edb527

fix build

ff850a6

add more log

2985c44

add more log

e78d731

add more log

f70ac1a

ti-chi-bot bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jul 24, 2025

ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 24, 2025

gemini-code-assist bot reviewed Jul 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ldz/add log0721 #1570

Ldz/add log0721 #1570

lidezhu commented Jul 24, 2025

Uh oh!

ti-chi-bot bot commented Jul 24, 2025

Uh oh!

ti-chi-bot bot commented Jul 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 24, 2025

Uh oh!

gemini-code-assist bot Jul 24, 2025

Uh oh!

gemini-code-assist bot Jul 24, 2025

Uh oh!

gemini-code-assist bot Jul 24, 2025

Uh oh!

gemini-code-assist bot Jul 24, 2025

Uh oh!

gemini-code-assist bot Jul 24, 2025

Uh oh!

ti-chi-bot bot commented Jul 24, 2025

Uh oh!

Uh oh!

	Buckets: prometheus.ExponentialBuckets(0.00004, 2.0, 10), // 0.1ms ~ 52s,
	Buckets: prometheus.ExponentialBuckets(0.00004, 2.0, 10),

Ldz/add log0721 #1570

Are you sure you want to change the base?

Ldz/add log0721 #1570

Conversation

lidezhu commented Jul 24, 2025

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Uh oh!

ti-chi-bot bot commented Jul 24, 2025

Uh oh!

ti-chi-bot bot commented Jul 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot bot commented Jul 24, 2025

Uh oh!

Uh oh!