-
Notifications
You must be signed in to change notification settings - Fork 9
chore: migrate metrics to OpenMetrics standard #281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #281 +/- ##
==========================================
- Coverage 71.06% 70.94% -0.13%
==========================================
Files 201 201
Lines 14773 14773
==========================================
- Hits 10498 10480 -18
- Misses 3527 3541 +14
- Partials 748 752 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
dkharms
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ![]()
📝 WalkthroughWalkthroughThis PR standardizes Prometheus metrics across 13 files to align with OpenMetrics naming conventions. Changes include renaming metric names to use standardized suffixes (e.g., Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Areas requiring attention:
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (3)
proxy/bulk/metrics.go (1)
22-28: Histogram Help is OK, but consider clarifying “per bulk” if that’s the intent.
Right now “Number of successfully stored docs” could be read as a global total (counter), while a histogram is usually “docs written per bulk/request”.metric/metric.go (1)
25-30:counters_totalHelp is very generic.
If feasible, mention whatnamelabel values represent (event types) to make the metric self-documenting.asyncsearcher/async_searcher.go (1)
46-70: Metrics renames/Help look consistent; tiny wording nit in read_only Help (“can not” → “cannot”).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
asyncsearcher/async_searcher.go(1 hunks)bytespool/bytespool.go(1 hunks)frac/active.go(1 hunks)frac/fraction.go(2 hunks)frac/processor/search_stats.go(1 hunks)fracmanager/metrics.go(4 hunks)indexer/metrics.go(2 hunks)metric/ingestor.go(5 hunks)metric/metric.go(1 hunks)metric/store.go(1 hunks)proxy/bulk/metrics.go(2 hunks)proxyapi/http_bulk.go(1 hunks)storeapi/grpc_bulk.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
proxyapi/http_bulk.go (1)
metric/metric.go (1)
SecondsBuckets(33-33)
frac/fraction.go (1)
metric/metric.go (1)
SecondsBuckets(33-33)
metric/store.go (1)
metric/metric.go (2)
SecondsBuckets(33-33)SecondsRanges(37-37)
metric/ingestor.go (1)
metric/metric.go (1)
SecondsBuckets(33-33)
🔇 Additional comments (9)
frac/active.go (1)
202-208: Help text addition forseq_db_store_bulk_stages_secondslooks good.indexer/metrics.go (1)
11-17: Metric rename + Help clarifications look consistent with OpenMetrics conventions.Also applies to: 26-31, 33-39
frac/processor/search_stats.go (1)
17-51: Histogram renames + Help text additions are clear and appropriately scoped.frac/fraction.go (1)
29-57: Seconds-suffix renames + Help additions look good; metric selection updated cleanly.Also applies to: 60-70
proxyapi/http_bulk.go (1)
29-55: Bulk metrics Help additions +_totalcounter rename look aligned with OpenMetrics guidance.bytespool/bytespool.go (1)
34-59: Bytes pool metric Help texts are clear and accurately describe semantics.metric/ingestor.go (2)
9-45: Search metrics: naming/Help looks aligned (seq_db_ingestor_search_total, etc.).
188-194: Verifyincoming_text_len_bytesis actually recording bytes (not runes/characters).
Since you renamed to_bytes, ensure the observed value useslen([]byte(s))semantics consistently.metric/metric.go (1)
11-17: Changeversion_info_totalto a Gauge namedversion_info.Prometheus best practice for version/build info is a Gauge with value=1 and labels for version/build fields (e.g.,
seq_db_version_info{version="..."}= 1). A Counter namedversion_info_totalmixes semantics: "info" suggests static metadata, while "_total" and Counter type suggest a cumulative count. This breaks standard dashboard queries and recording rules that expect the*_infogauge pattern.Refactor to use
prometheus.NewGaugeVecwith metric nameversion_infoand set value to 1 ininit()instead of incrementing.
| cacheMissTotal = promauto.NewCounterVec(prometheus.CounterOpts{ | ||
| Namespace: "seq_db_store", | ||
| Subsystem: "cache", | ||
| Name: "miss_total", | ||
| Help: "Number of cache misses by layer", | ||
| }, []string{"layer"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent counter name: miss_total should likely be misses_total.
This will be a breaking rename anyway; worth aligning with hits_total.
🤖 Prompt for AI Agents
In fracmanager/metrics.go around lines 54 to 59, the Prometheus counter is named
"miss_total" which is inconsistent with "hits_total" and Prometheus conventions;
change the metric Name to "misses_total" (and optionally update the Help text to
"Number of cache misses by layer") and then update any code, tests, dashboards,
alerts and documentation that reference the old "miss_total" metric to the new
"misses_total" name to avoid breakage.
| cacheSizeOccupied = promauto.NewCounterVec(prometheus.CounterOpts{ | ||
| Namespace: "seq_db_store", | ||
| Subsystem: "cache", | ||
| Name: "miss_size_total", | ||
| Name: "miss_size_bytes_total", | ||
| Help: "Total size occupied in cache by layer in bytes", | ||
| }, []string{"layer"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type/semantics mismatch: miss_size_bytes_total (Counter) vs Help implying current occupancy.
If this is “current bytes occupied”, it should be a Gauge (e.g., size_bytes / occupied_bytes). If it’s “bytes added due to misses over time”, keep Counter but update Help to “Total bytes …”.
🤖 Prompt for AI Agents
In fracmanager/metrics.go around lines 90 to 95, the metric is declared as a
Counter named miss_size_bytes_total but the Help text describes current cache
occupancy; decide the intended semantics and fix: if this should represent
current bytes occupied per layer, change the metric to a Gauge (rename to
something like size_bytes or occupied_bytes) and update Name/Help accordingly;
if it should represent cumulative bytes added due to misses, keep a Counter but
update the Help string to clearly state it is the total bytes added due to cache
misses (e.g., "Total bytes added to cache due to misses, by layer").
| BulkMetaBytesTotal = promauto.NewHistogram(prometheus.HistogramOpts{ | ||
| Namespace: "seq_db_store", | ||
| Subsystem: "bulk", | ||
| Name: "meta_bytes_total", | ||
| Help: "", | ||
| Name: "meta_bytes", | ||
| Help: "Byte size of metadata in bulk requests in bytes", | ||
| Buckets: prometheus.ExponentialBuckets(1, 4, 16), | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tighten a couple Help strings (duplication / phrasing).
BulkMetaBytesTotal = promauto.NewHistogram(prometheus.HistogramOpts{
Namespace: "seq_db_store",
Subsystem: "bulk",
Name: "meta_bytes",
- Help: "Byte size of metadata in bulk requests in bytes",
+ Help: "Byte size of metadata in bulk requests",
Buckets: prometheus.ExponentialBuckets(1, 4, 16),
})
SearchDurationSeconds = promauto.NewHistogram(prometheus.HistogramOpts{
Namespace: "seq_db_store",
Subsystem: "search",
Name: "duration_seconds",
- Help: "Search request duration time (only successful searches)",
+ Help: "Search request duration (only successful searches)",
Buckets: SecondsBuckets,
})Also applies to: 56-62
🤖 Prompt for AI Agents
In metric/store.go around lines 37-43 (and similarly lines 56-62), the Help
strings are wordy and duplicate units; tighten them to concise, consistent
phrases: replace "Byte size of metadata in bulk requests in bytes" with
something like "Size of metadata per bulk request (bytes)" and likewise change
the other Help value to "Size of payload per bulk request (bytes)"; ensure both
use consistent capitalization and unit notation.
| inflightBulks = promauto.NewGauge(prometheus.GaugeOpts{ | ||
| Namespace: "seq_db_ingestor", | ||
| Subsystem: "bulk", | ||
| Name: "in_flight_queries_total", | ||
| Help: "", | ||
| Name: "in_flight_queries", | ||
| Help: "Number of bulks that are being processed by ingestor currently", | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metric name/Help mismatch: in_flight_queries tracks bulks, not queries.
Either rename to something like in_flight_bulks / in_flight_requests, or adjust Help to match “queries” if that’s truly what’s being counted.
🤖 Prompt for AI Agents
In proxy/bulk/metrics.go around lines 9 to 14 the metric Name
"in_flight_queries" conflicts with the Help which says it counts bulks; change
the metric to consistently represent bulks: rename Name to "in_flight_bulks" (or
"in_flight_requests" if preferred), update Help to "Number of bulks that are
being processed by ingestor currently", and update any code, tests, dashboards,
or alerting rules that reference the old metric name to use the new name.
| var inflightBulksTotal = promauto.NewGauge(prometheus.GaugeOpts{ | ||
| Namespace: "seq_db_store", | ||
| Subsystem: "bulk", | ||
| Name: "in_flight_queries_total", | ||
| Name: "in_flight_queries", | ||
| Help: "Number of bulk request being currently processed by store", | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix Help grammar (and optionally align wording across components).
Current: “Number of bulk request being currently processed by store” → suggest:
var inflightBulksTotal = promauto.NewGauge(prometheus.GaugeOpts{
Namespace: "seq_db_store",
Subsystem: "bulk",
Name: "in_flight_queries",
- Help: "Number of bulk request being currently processed by store",
+ Help: "Number of bulk requests currently being processed by the store",
})🤖 Prompt for AI Agents
In storeapi/grpc_bulk.go around lines 20 to 25, the metric Help string grammar
is incorrect; update the Help value to a grammatically correct and consistent
phrasing such as "Number of bulk requests currently being processed by the
store" (or another aligned wording used across components), ensuring
pluralization, article usage, and clarity match other metrics.
|
nit:
+1 Is .github/CONTRIBUTING.md the right place for this? |
|
Added a new section "General Coding Guidelines" to |
frac/processor/search_stats.go
Outdated
| Namespace: metricsNamespace, | ||
| Subsystem: metricsSubsystem, | ||
| Name: "sources_total", | ||
| Name: "sources", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're renaming it anyway, let's pick a more descriptive name connected with lids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from now on it's lids_loaded
frac/fraction.go
Outdated
| Namespace: "seq_db_store", | ||
| Subsystem: "search", | ||
| Name: "tracer_fraction_agg_search_sec", | ||
| Name: "tracer_fraction_agg_search_seconds", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets remove "tracer_" prefix? Here and below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
metric/metric.go
Outdated
| Namespace: "seq_db", | ||
| Name: "version", | ||
| Help: "", | ||
| Name: "version_info_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might be using a counter here unnecessarily? Maybe we should have used a gauge instead?
Or rename the metric to version_starts_total.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thing is, I'd avoid changing type since we need to maintain both old and new in a single dashboard. I'm not sure if we can do that. Maybe in a different PR
metric/metric.go
Outdated
| Name: "version", | ||
| Help: "", | ||
| Name: "version_info_total", | ||
| Help: "Number of seq-db instances with a particular version", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And accordingly, the description is something like: always 1 for the current version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated a desc
metric/store.go
Outdated
| Name: "meta_bytes_total", | ||
| Help: "", | ||
| Name: "meta_bytes", | ||
| Help: "Byte size of metadata in bulk requests in bytes", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proxy/bulk/metrics.go
Outdated
| Subsystem: "bulk", | ||
| Name: "in_flight_queries_total", | ||
| Help: "", | ||
| Name: "in_flight_queries", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with rabbit here as well. And probably we can call it just in_flight.
storeapi/grpc_bulk.go
Outdated
| Namespace: "seq_db_store", | ||
| Subsystem: "bulk", | ||
| Name: "in_flight_queries_total", | ||
| Name: "in_flight_queries", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can just call it in_flight (since we're changing it anyway)
and code rabbit also suggested a fix in Help

Description
Migrated metrics to OpenMetrics standard. I will list the primary rules of the standard and what has been changed.
_totalsuffix in their name. I added_totalsuffix to all counters where it's necessary._total,_count,_created,_sum,_bucket. I removed_totalsuffix from all gauge/hist metrics. There was a single metric with_countsuffix, which was also renamed.timestamp_secondsare units which are recommended to use. There is prometheus naming best practices: https://prometheus.io/docs/practices/naming/secinstead ofseconds. I changed_secto_secondswhenever I could.tracer_fraction_reg_search_sec->tracer_fraction_regular_search_secondsin the metrics names, but cases likeagg,docsandmetaare used commonly and could be understood easily. I left them as it is. I also unifieddocumentsanddocsin metric names - nowdocsterm is used everywhere.histname in them, but it's acceptable since they do not describe metric type but rather an operation type (search with hist param).leиquantileare forbidden names for labels. We didn't have any of them.I also found the following cases:
TokenizerTokensPerMessage,TokenizerParseDurationSecondsare not used currentlyskippedIndexes,skippedIndexesBytes(used for tokenization) are currently located in metric/store.go, i.e. place where store metrics. perhaps we could move them toingestorortokenizermetrics.go fileskip_shard_total- seems like an inconsistent name since we choose a shard for storing but loop through replicas and skip a replica if it was already chosen for storing the current bulk.Suggestion: we should have a development guidlines file where we could store rules like "counters must end with _total" and "metrics should never end with _total"
Fixes #280
If you have used LLM/AI assistance please provide model name and full prompt:
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.