-
Notifications
You must be signed in to change notification settings - Fork 24
Ldz/add log0721 #1570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Ldz/add log0721 #1570
Conversation
[FORMAT CHECKER NOTIFICATION] Notice: To remove the 📖 For more info, you can check the "Contribute Code" section in the development guide. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @lidezhu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the observability and monitoring capabilities within the log service. It achieves this by promoting key debug logs to info level, introducing new internal state tracking for subscriptions, adding specific warning logs for resolved timestamp (TS) lag, and implementing a new Prometheus metric to measure dispatcher notification duration. The changes aim to provide better insights into the health and performance of data synchronization, helping to quickly identify and diagnose potential bottlenecks.
Highlights
- Enhanced Observability Logging: Many existing
log.Debug
statements acrosslogservice/logpuller
andlogservice/eventstore
components have been upgraded tolog.Info
level. This change makes critical events, such as region errors, region initialization, range lock operations, and subscription client activities, more visible in standard log outputs, improving overall system observability. - Improved Resolved Timestamp (TS) Monitoring: New fields (
initialized
,lastAdvanceTime
,lastReceiveDMLTime
) have been added tosubscriptionStat
inevent_store.go
to track the state and activity of subscriptions. Warning logs are now emitted when an initialized subscription's resolved TS lag exceeds 10 seconds, or when a large resolved TS advance step (>10s) is detected, providing proactive alerts for potential data synchronization delays. - New Performance Metric for Dispatchers: A new Prometheus histogram,
EventStoreNotifyDispatcherDurationHist
, has been introduced inpkg/metrics/event_store.go
and integrated intoevent_store.go
. This metric measures the duration it takes to notify dispatchers with resolved timestamps, offering valuable insights into the performance of the event delivery pipeline. - Extended RangeLock Information: The
ResolvedTs()
method inlogservice/logpuller/regionlock/region_range_lock.go
has been modified to return not only the minimum resolved timestamp but also theregionID
associated with that timestamp. This provides more context for identifying specific regions contributing to resolved TS lag. - Dedicated Slow Span Logging: A new goroutine,
logSlowSpan
, has been added tologservice/logpuller/subscription_client.go
. This routine periodically checks and logs warnings for any initialized spans where the resolved TS lag exceeds 10 seconds, complementing the existing slow region logging.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces additional logging and metrics to improve observability of the log service, particularly around resolved timestamp lag and subscription states. The changes are spread across eventstore
, logpuller
, and metrics
packages.
While the new logging is valuable, several log messages have been elevated from Debug
to Info
in high-frequency code paths, which could lead to log flooding and performance issues. Additionally, there are several instances of magic numbers for thresholds and intervals that should be defined as constants. I've also identified a metric that is being calculated incorrectly.
My feedback focuses on addressing these maintainability and correctness issues to ensure the new additions are robust and don't negatively impact the system.
@@ -502,6 +514,7 @@ func (e *eventStore) RegisterDispatcher( | |||
notifier(ts, subStat.maxEventCommitTs.Load()) | |||
} | |||
CounterResolved.Inc() | |||
metrics.EventStoreNotifyDispatcherDurationHist.Observe(float64(time.Since(start).Milliseconds()) / 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric EventStoreNotifyDispatcherDurationHist
is being updated with time.Since(start)
. The start
variable is captured from the RegisterDispatcher
function's scope, which means this will measure the time since the dispatcher was registered, not the duration of this specific notification. This will result in misleadingly large values for the metric.
The duration should be measured within this if
block to accurately reflect the notification time. For example, by recording time.Now()
at the beginning of the if
block and using it to calculate the duration.
@@ -805,6 +818,22 @@ func (e *eventStore) updateMetricsOnce() { | |||
resolvedTs := subStat.resolvedTs.Load() | |||
resolvedPhyTs := oracle.ExtractPhysical(resolvedTs) | |||
resolvedLag := float64(pdPhyTs-resolvedPhyTs) / 1e3 | |||
if subStat.initialized.Load() && resolvedLag >= 10 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic number 10
is used as a threshold for the resolved timestamp lag in seconds. To improve readability and maintainability, it's better to define this as a named constant at the package level. This value is used in multiple places in this pull request.
const largeResolvedTsLagInSecs = 10
if subStat.initialized.Load() && resolvedLag >= largeResolvedTsLagInSecs {
// Generally, we don't want to send duplicate resolved ts, | ||
// so we check whether `ts` is larger than `lastResolvedTs` before send it. | ||
// but when `ts` == `lastResolvedTs` == `span.startTs`, | ||
// the span may just be initialized and have not receive any resolved ts before, | ||
// so we also send ts in this case for quick notification to downstream. | ||
if ts > lastResolvedTs || (ts == lastResolvedTs && lastResolvedTs == span.startTs) { | ||
resolvedPhyTs := oracle.ExtractPhysical(lastResolvedTs) | ||
decreaseLag := float64(nextResolvedPhyTs-resolvedPhyTs) / 1e3 | ||
if decreaseLag > 10 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic number 10
is used as a threshold for a large resolved timestamp advance step. This should be defined as a named constant to improve code clarity and maintainability, especially since this value is used elsewhere.
const largeResolvedTsAdvanceStepInSecs = 10
if decreaseLag > largeResolvedTsAdvanceStepInSecs {
log.Info("subscription client is going to load regions", | ||
zap.Uint64("subscriptionID", uint64(subscribedSpan.subID)), | ||
zap.Any("span", nextSpan)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log message "subscription client is going to load regions" is inside a loop. While it's before a network call, logging it at Info
level might still be too verbose if many small ranges are processed. Debug
level would be more suitable for this kind of "about to do X" message.
log.Info("subscription client is going to load regions", | |
zap.Uint64("subscriptionID", uint64(subscribedSpan.subID)), | |
zap.Any("span", nextSpan)) | |
log.Debug("subscription client is going to load regions", | |
zap.Uint64("subscriptionID", uint64(subscribedSpan.subID)), | |
zap.Any("span", nextSpan)) |
ticker := time.NewTicker(10 * time.Second) | ||
defer ticker.Stop() | ||
for { | ||
select { | ||
case <-ctx.Done(): | ||
return ctx.Err() | ||
case <-ticker.C: | ||
} | ||
|
||
pdTime := s.pdClock.CurrentTime() | ||
pdPhyTs := oracle.GetPhysical(pdTime) | ||
s.totalSpans.RLock() | ||
for subID, rt := range s.totalSpans.spanMap { | ||
if !rt.initialized.Load() { | ||
continue | ||
} | ||
resolvedTs := rt.rangeLock.GetHeapMinTs() | ||
resolvedPhyTs := oracle.ExtractPhysical(resolvedTs) | ||
resolvedLag := float64(pdPhyTs-resolvedPhyTs) / 1e3 | ||
if resolvedLag > 10 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new function logSlowSpan
uses a couple of magic numbers:
10 * time.Second
for the ticker interval (line 959)10
for the lag threshold in seconds (line 978)
These should be defined as named constants at the package level to improve readability and maintainability.
ticker := time.NewTicker(10 * time.Second) | |
defer ticker.Stop() | |
for { | |
select { | |
case <-ctx.Done(): | |
return ctx.Err() | |
case <-ticker.C: | |
} | |
pdTime := s.pdClock.CurrentTime() | |
pdPhyTs := oracle.GetPhysical(pdTime) | |
s.totalSpans.RLock() | |
for subID, rt := range s.totalSpans.spanMap { | |
if !rt.initialized.Load() { | |
continue | |
} | |
resolvedTs := rt.rangeLock.GetHeapMinTs() | |
resolvedPhyTs := oracle.ExtractPhysical(resolvedTs) | |
resolvedLag := float64(pdPhyTs-resolvedPhyTs) / 1e3 | |
if resolvedLag > 10 { | |
const logSlowSpanTickerInterval = 10 * time.Second | |
const resolvedLagThresholdInSecs = 10 | |
ticker := time.NewTicker(logSlowSpanTickerInterval) | |
if resolvedLag > resolvedLagThresholdInSecs { |
EventStoreWriteDurationHistogram = prometheus.NewHistogram(prometheus.HistogramOpts{ | ||
Namespace: "ticdc", | ||
Subsystem: "event_store", | ||
Name: "write_duration", | ||
Help: "Bucketed histogram of event store write duration", | ||
Buckets: prometheus.ExponentialBuckets(0.00004, 2.0, 10), | ||
Buckets: prometheus.ExponentialBuckets(0.00004, 2.0, 10), // 0.1ms ~ 52s, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment // 0.1ms ~ 52s,
seems incorrect for the buckets defined by prometheus.ExponentialBuckets(0.00004, 2.0, 10)
. The buckets for this metric range from 40µs to approximately 20.48ms. The comment was likely misplaced from another metric definition and should be corrected or removed to avoid confusion.
Buckets: prometheus.ExponentialBuckets(0.00004, 2.0, 10), // 0.1ms ~ 52s, | |
Buckets: prometheus.ExponentialBuckets(0.00004, 2.0, 10), |
@lidezhu: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: close #xxx
What is changed and how it works?
Check List
Tests
Questions
Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?
Release note