Skip to content

Ldz/add log0721 #1570

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Ldz/add log0721 #1570

wants to merge 12 commits into from

Conversation

lidezhu
Copy link
Collaborator

@lidezhu lidezhu commented Jul 24, 2025

What problem does this PR solve?

Issue Number: close #xxx

What is changed and how it works?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jul 24, 2025
Copy link

ti-chi-bot bot commented Jul 24, 2025

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

📖 For more info, you can check the "Contribute Code" section in the development guide.

Copy link

ti-chi-bot bot commented Jul 24, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign charlescheung96 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 24, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @lidezhu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the observability and monitoring capabilities within the log service. It achieves this by promoting key debug logs to info level, introducing new internal state tracking for subscriptions, adding specific warning logs for resolved timestamp (TS) lag, and implementing a new Prometheus metric to measure dispatcher notification duration. The changes aim to provide better insights into the health and performance of data synchronization, helping to quickly identify and diagnose potential bottlenecks.

Highlights

  • Enhanced Observability Logging: Many existing log.Debug statements across logservice/logpuller and logservice/eventstore components have been upgraded to log.Info level. This change makes critical events, such as region errors, region initialization, range lock operations, and subscription client activities, more visible in standard log outputs, improving overall system observability.
  • Improved Resolved Timestamp (TS) Monitoring: New fields (initialized, lastAdvanceTime, lastReceiveDMLTime) have been added to subscriptionStat in event_store.go to track the state and activity of subscriptions. Warning logs are now emitted when an initialized subscription's resolved TS lag exceeds 10 seconds, or when a large resolved TS advance step (>10s) is detected, providing proactive alerts for potential data synchronization delays.
  • New Performance Metric for Dispatchers: A new Prometheus histogram, EventStoreNotifyDispatcherDurationHist, has been introduced in pkg/metrics/event_store.go and integrated into event_store.go. This metric measures the duration it takes to notify dispatchers with resolved timestamps, offering valuable insights into the performance of the event delivery pipeline.
  • Extended RangeLock Information: The ResolvedTs() method in logservice/logpuller/regionlock/region_range_lock.go has been modified to return not only the minimum resolved timestamp but also the regionID associated with that timestamp. This provides more context for identifying specific regions contributing to resolved TS lag.
  • Dedicated Slow Span Logging: A new goroutine, logSlowSpan, has been added to logservice/logpuller/subscription_client.go. This routine periodically checks and logs warnings for any initialized spans where the resolved TS lag exceeds 10 seconds, complementing the existing slow region logging.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces additional logging and metrics to improve observability of the log service, particularly around resolved timestamp lag and subscription states. The changes are spread across eventstore, logpuller, and metrics packages.

While the new logging is valuable, several log messages have been elevated from Debug to Info in high-frequency code paths, which could lead to log flooding and performance issues. Additionally, there are several instances of magic numbers for thresholds and intervals that should be defined as constants. I've also identified a metric that is being calculated incorrectly.

My feedback focuses on addressing these maintainability and correctness issues to ensure the new additions are robust and don't negatively impact the system.

@@ -502,6 +514,7 @@ func (e *eventStore) RegisterDispatcher(
notifier(ts, subStat.maxEventCommitTs.Load())
}
CounterResolved.Inc()
metrics.EventStoreNotifyDispatcherDurationHist.Observe(float64(time.Since(start).Milliseconds()) / 1000)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The metric EventStoreNotifyDispatcherDurationHist is being updated with time.Since(start). The start variable is captured from the RegisterDispatcher function's scope, which means this will measure the time since the dispatcher was registered, not the duration of this specific notification. This will result in misleadingly large values for the metric.

The duration should be measured within this if block to accurately reflect the notification time. For example, by recording time.Now() at the beginning of the if block and using it to calculate the duration.

@@ -805,6 +818,22 @@ func (e *eventStore) updateMetricsOnce() {
resolvedTs := subStat.resolvedTs.Load()
resolvedPhyTs := oracle.ExtractPhysical(resolvedTs)
resolvedLag := float64(pdPhyTs-resolvedPhyTs) / 1e3
if subStat.initialized.Load() && resolvedLag >= 10 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The magic number 10 is used as a threshold for the resolved timestamp lag in seconds. To improve readability and maintainability, it's better to define this as a named constant at the package level. This value is used in multiple places in this pull request.

const largeResolvedTsLagInSecs = 10

            if subStat.initialized.Load() && resolvedLag >= largeResolvedTsLagInSecs {

// Generally, we don't want to send duplicate resolved ts,
// so we check whether `ts` is larger than `lastResolvedTs` before send it.
// but when `ts` == `lastResolvedTs` == `span.startTs`,
// the span may just be initialized and have not receive any resolved ts before,
// so we also send ts in this case for quick notification to downstream.
if ts > lastResolvedTs || (ts == lastResolvedTs && lastResolvedTs == span.startTs) {
resolvedPhyTs := oracle.ExtractPhysical(lastResolvedTs)
decreaseLag := float64(nextResolvedPhyTs-resolvedPhyTs) / 1e3
if decreaseLag > 10 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The magic number 10 is used as a threshold for a large resolved timestamp advance step. This should be defined as a named constant to improve code clarity and maintainability, especially since this value is used elsewhere.

const largeResolvedTsAdvanceStepInSecs = 10

            if decreaseLag > largeResolvedTsAdvanceStepInSecs {

Comment on lines +619 to 621
log.Info("subscription client is going to load regions",
zap.Uint64("subscriptionID", uint64(subscribedSpan.subID)),
zap.Any("span", nextSpan))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This log message "subscription client is going to load regions" is inside a loop. While it's before a network call, logging it at Info level might still be too verbose if many small ranges are processed. Debug level would be more suitable for this kind of "about to do X" message.

Suggested change
log.Info("subscription client is going to load regions",
zap.Uint64("subscriptionID", uint64(subscribedSpan.subID)),
zap.Any("span", nextSpan))
log.Debug("subscription client is going to load regions",
zap.Uint64("subscriptionID", uint64(subscribedSpan.subID)),
zap.Any("span", nextSpan))

Comment on lines +959 to +978
ticker := time.NewTicker(10 * time.Second)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return ctx.Err()
case <-ticker.C:
}

pdTime := s.pdClock.CurrentTime()
pdPhyTs := oracle.GetPhysical(pdTime)
s.totalSpans.RLock()
for subID, rt := range s.totalSpans.spanMap {
if !rt.initialized.Load() {
continue
}
resolvedTs := rt.rangeLock.GetHeapMinTs()
resolvedPhyTs := oracle.ExtractPhysical(resolvedTs)
resolvedLag := float64(pdPhyTs-resolvedPhyTs) / 1e3
if resolvedLag > 10 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This new function logSlowSpan uses a couple of magic numbers:

  • 10 * time.Second for the ticker interval (line 959)
  • 10 for the lag threshold in seconds (line 978)

These should be defined as named constants at the package level to improve readability and maintainability.

Suggested change
ticker := time.NewTicker(10 * time.Second)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return ctx.Err()
case <-ticker.C:
}
pdTime := s.pdClock.CurrentTime()
pdPhyTs := oracle.GetPhysical(pdTime)
s.totalSpans.RLock()
for subID, rt := range s.totalSpans.spanMap {
if !rt.initialized.Load() {
continue
}
resolvedTs := rt.rangeLock.GetHeapMinTs()
resolvedPhyTs := oracle.ExtractPhysical(resolvedTs)
resolvedLag := float64(pdPhyTs-resolvedPhyTs) / 1e3
if resolvedLag > 10 {
const logSlowSpanTickerInterval = 10 * time.Second
const resolvedLagThresholdInSecs = 10
ticker := time.NewTicker(logSlowSpanTickerInterval)
if resolvedLag > resolvedLagThresholdInSecs {

EventStoreWriteDurationHistogram = prometheus.NewHistogram(prometheus.HistogramOpts{
Namespace: "ticdc",
Subsystem: "event_store",
Name: "write_duration",
Help: "Bucketed histogram of event store write duration",
Buckets: prometheus.ExponentialBuckets(0.00004, 2.0, 10),
Buckets: prometheus.ExponentialBuckets(0.00004, 2.0, 10), // 0.1ms ~ 52s,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment // 0.1ms ~ 52s, seems incorrect for the buckets defined by prometheus.ExponentialBuckets(0.00004, 2.0, 10). The buckets for this metric range from 40µs to approximately 20.48ms. The comment was likely misplaced from another metric definition and should be corrected or removed to avoid confusion.

Suggested change
Buckets: prometheus.ExponentialBuckets(0.00004, 2.0, 10), // 0.1ms ~ 52s,
Buckets: prometheus.ExponentialBuckets(0.00004, 2.0, 10),

Copy link

ti-chi-bot bot commented Jul 24, 2025

@lidezhu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test f70ac1a link true /test pull-unit-test
pull-cdc-mysql-integration-light f70ac1a link true /test pull-cdc-mysql-integration-light
pull-cdc-mysql-integration-heavy f70ac1a link true /test pull-cdc-mysql-integration-heavy

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant