chore: migrate metrics to OpenMetrics standard #281

cheb0 · 2025-12-01T08:32:48Z

Description

Migrated metrics to OpenMetrics standard. I will list the primary rules of the standard and what has been changed.

Counter metrics must have _total suffix in their name. I added _total suffix to all counters where it's necessary.
All other metrics (except for counters) must not have any of the following suffixes: _total, _count, _created, _sum, _bucket. I removed _total suffix from all gauge/hist metrics. There was a single metric with _count suffix, which was also renamed.
It's recommended to have unit types at the end of the metric if appropriate: seconds, bytes, fractions, etc. I added _seconds or _bytes where appropriate. For oldest entry timestamp_seconds are units which are recommended to use. There is prometheus naming best practices: https://prometheus.io/docs/practices/naming/
It's recomended to avoid short names and abbreviations for unit types like sec instead of seconds. I changed _sec to _seconds whenever I could.
It's recommended to avoid short terms if they are not commonly used. I replaced a few places like tracer_fraction_reg_search_sec -> tracer_fraction_regular_search_seconds in the metrics names, but cases like agg, docs and meta are used commonly and could be understood easily. I left them as it is. I also unified documents and docs in metric names - now docs term is used everywhere.
It's not recommended to have metric type in the name. We do not have such metrics. Some metrics have hist name in them, but it's acceptable since they do not describe metric type but rather an operation type (search with hist param).
It's recommended to have help for each metric. I added help for all metrics, let's review if help makes sense.
snake_case is recommended for metric names and labels. Already done.
le и quantile are forbidden names for labels. We didn't have any of them.

I also found the following cases:

metrics TokenizerTokensPerMessage, TokenizerParseDurationSeconds are not used currently
skippedIndexes, skippedIndexesBytes (used for tokenization) are currently located in metric/store.go, i.e. place where store metrics. perhaps we could move them to ingestor or tokenizer metrics.go file
skip_shard_total - seems like an inconsistent name since we choose a shard for storing but loop through replicas and skip a replica if it was already chosen for storing the current bulk.

Suggestion: we should have a development guidlines file where we could store rules like "counters must end with _total" and "metrics should never end with _total"

Fixes #280

I have read and followed all requirements in CONTRIBUTING.md;
I used LLM/AI assistance to make this pull request;

If you have used LLM/AI assistance please provide model name and full prompt:

Model: {{model-name}}
Prompt: {{prompt}}

Summary by CodeRabbit

Documentation
- Enhanced Prometheus metrics documentation with clearer help text across multiple modules, improving observability and monitoring clarity.
- Standardized metric naming conventions with more explicit suffixes (_total, _bytes, _seconds) for better metric identification and consistency in monitoring systems.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

codecov-commenter · 2025-12-01T08:36:45Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.94%. Comparing base (4c86707) to head (5d45557).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #281      +/-   ##
==========================================
- Coverage   71.06%   70.94%   -0.13%     
==========================================
  Files         201      201              
  Lines       14773    14773              
==========================================
- Hits        10498    10480      -18     
- Misses       3527     3541      +14     
- Partials      748      752       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

dkharms · 2025-12-12T08:00:13Z

@coderabbitai review

coderabbitai · 2025-12-12T08:00:20Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

dkharms

LGTM

coderabbitai · 2025-12-12T08:03:21Z

📝 Walkthrough

Walkthrough

This PR standardizes Prometheus metrics across 13 files to align with OpenMetrics naming conventions. Changes include renaming metric names to use standardized suffixes (e.g., _seconds, _bytes_total, _total) and adding descriptive Help text to previously undocumented metrics. No functional logic or control flow is modified; only metric metadata and identifiers are updated.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Areas requiring attention:

Verify metric name changes follow consistent OpenMetrics conventions across files (e.g., _total vs _bytes_total vs _seconds suffixes applied correctly)
Spot-check Help text accuracy and consistency in tone/clarity, particularly in files with many renamed metrics (metric/ingestor.go, fracmanager/metrics.go)
Confirm that metric name changes don't break external dependencies or dashboards that reference the old names
Verify renamed metrics like in_flight_queries_total → in_flight_queries are intentional and consistent with standard conventions

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately describes the main change—migrating metrics to OpenMetrics standard—matching the objectives and scope of changes.
Linked Issues check	✅ Passed	Code changes comprehensively implement OpenMetrics migration: counter suffixes added, help text added, unit suffixes applied, and naming conventions standardized across all modified files.
Out of Scope Changes check	✅ Passed	All changes are scoped to metric naming, help text, and unit suffix alignment per OpenMetrics standard; no unrelated code modifications or functional changes detected.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (3)

proxy/bulk/metrics.go (1)

22-28: Histogram Help is OK, but consider clarifying “per bulk” if that’s the intent.
Right now “Number of successfully stored docs” could be read as a global total (counter), while a histogram is usually “docs written per bulk/request”.

metric/metric.go (1)

25-30: counters_total Help is very generic.
If feasible, mention what name label values represent (event types) to make the metric self-documenting.

asyncsearcher/async_searcher.go (1)

46-70: Metrics renames/Help look consistent; tiny wording nit in read_only Help (“can not” → “cannot”).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dfd50b6 and 4cf6a04.

📒 Files selected for processing (13)

asyncsearcher/async_searcher.go (1 hunks)
bytespool/bytespool.go (1 hunks)
frac/active.go (1 hunks)
frac/fraction.go (2 hunks)
frac/processor/search_stats.go (1 hunks)
fracmanager/metrics.go (4 hunks)
indexer/metrics.go (2 hunks)
metric/ingestor.go (5 hunks)
metric/metric.go (1 hunks)
metric/store.go (1 hunks)
proxy/bulk/metrics.go (2 hunks)
proxyapi/http_bulk.go (1 hunks)
storeapi/grpc_bulk.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (4)

proxyapi/http_bulk.go (1)

metric/metric.go (1)

SecondsBuckets (33-33)

frac/fraction.go (1)

metric/metric.go (1)

SecondsBuckets (33-33)

metric/store.go (1)

metric/metric.go (2)

SecondsBuckets (33-33)

SecondsRanges (37-37)

metric/ingestor.go (1)

metric/metric.go (1)

SecondsBuckets (33-33)

🔇 Additional comments (9)

frac/active.go (1)

202-208: Help text addition for seq_db_store_bulk_stages_seconds looks good.

indexer/metrics.go (1)

11-17: Metric rename + Help clarifications look consistent with OpenMetrics conventions.

Also applies to: 26-31, 33-39

frac/processor/search_stats.go (1)

17-51: Histogram renames + Help text additions are clear and appropriately scoped.

frac/fraction.go (1)

29-57: Seconds-suffix renames + Help additions look good; metric selection updated cleanly.

Also applies to: 60-70

proxyapi/http_bulk.go (1)

29-55: Bulk metrics Help additions + _total counter rename look aligned with OpenMetrics guidance.

bytespool/bytespool.go (1)

34-59: Bytes pool metric Help texts are clear and accurately describe semantics.

metric/ingestor.go (2)

9-45: Search metrics: naming/Help looks aligned (seq_db_ingestor_search_total, etc.).

188-194: Verify incoming_text_len_bytes is actually recording bytes (not runes/characters).
Since you renamed to _bytes, ensure the observed value uses len([]byte(s)) semantics consistently.

metric/metric.go (1)

11-17: Change version_info_total to a Gauge named version_info.

Prometheus best practice for version/build info is a Gauge with value=1 and labels for version/build fields (e.g., seq_db_version_info{version="..."} = 1). A Counter named version_info_total mixes semantics: "info" suggests static metadata, while "_total" and Counter type suggest a cumulative count. This breaks standard dashboard queries and recording rules that expect the *_info gauge pattern.

Refactor to use prometheus.NewGaugeVec with metric name version_info and set value to 1 in init() instead of incrementing.

coderabbitai · 2025-12-12T08:03:24Z

fracmanager/metrics.go

 	cacheMissTotal = promauto.NewCounterVec(prometheus.CounterOpts{
 		Namespace: "seq_db_store",
 		Subsystem: "cache",
 		Name:      "miss_total",
+		Help:      "Number of cache misses by layer",
 	}, []string{"layer"})


⚠️ Potential issue | 🟡 Minor

Inconsistent counter name: miss_total should likely be misses_total.
This will be a breaking rename anyway; worth aligning with hits_total.

🤖 Prompt for AI Agents

In fracmanager/metrics.go around lines 54 to 59, the Prometheus counter is named "miss_total" which is inconsistent with "hits_total" and Prometheus conventions; change the metric Name to "misses_total" (and optionally update the Help text to "Number of cache misses by layer") and then update any code, tests, dashboards, alerts and documentation that reference the old "miss_total" metric to the new "misses_total" name to avoid breakage.

coderabbitai · 2025-12-12T08:03:24Z

fracmanager/metrics.go

 	cacheSizeOccupied = promauto.NewCounterVec(prometheus.CounterOpts{
 		Namespace: "seq_db_store",
 		Subsystem: "cache",
-		Name:      "miss_size_total",
+		Name:      "miss_size_bytes_total",
+		Help:      "Total size occupied in cache by layer in bytes",
 	}, []string{"layer"})


⚠️ Potential issue | 🟠 Major

Type/semantics mismatch: miss_size_bytes_total (Counter) vs Help implying current occupancy.
If this is “current bytes occupied”, it should be a Gauge (e.g., size_bytes / occupied_bytes). If it’s “bytes added due to misses over time”, keep Counter but update Help to “Total bytes …”.

🤖 Prompt for AI Agents

In fracmanager/metrics.go around lines 90 to 95, the metric is declared as a Counter named miss_size_bytes_total but the Help text describes current cache occupancy; decide the intended semantics and fix: if this should represent current bytes occupied per layer, change the metric to a Gauge (rename to something like size_bytes or occupied_bytes) and update Name/Help accordingly; if it should represent cumulative bytes added due to misses, keep a Counter but update the Help string to clearly state it is the total bytes added due to cache misses (e.g., "Total bytes added to cache due to misses, by layer").

coderabbitai · 2025-12-12T08:03:24Z

metric/store.go

 	BulkMetaBytesTotal = promauto.NewHistogram(prometheus.HistogramOpts{
 		Namespace: "seq_db_store",
 		Subsystem: "bulk",
-		Name:      "meta_bytes_total",
-		Help:      "",
+		Name:      "meta_bytes",
+		Help:      "Byte size of metadata in bulk requests in bytes",
 		Buckets:   prometheus.ExponentialBuckets(1, 4, 16),
 	})


⚠️ Potential issue | 🟡 Minor

Tighten a couple Help strings (duplication / phrasing).

BulkMetaBytesTotal = promauto.NewHistogram(prometheus.HistogramOpts{ Namespace: "seq_db_store", Subsystem: "bulk", Name: "meta_bytes", - Help: "Byte size of metadata in bulk requests in bytes", + Help: "Byte size of metadata in bulk requests", Buckets: prometheus.ExponentialBuckets(1, 4, 16), }) SearchDurationSeconds = promauto.NewHistogram(prometheus.HistogramOpts{ Namespace: "seq_db_store", Subsystem: "search", Name: "duration_seconds", - Help: "Search request duration time (only successful searches)", + Help: "Search request duration (only successful searches)", Buckets: SecondsBuckets, })

Also applies to: 56-62

🤖 Prompt for AI Agents

In metric/store.go around lines 37-43 (and similarly lines 56-62), the Help strings are wordy and duplicate units; tighten them to concise, consistent phrases: replace "Byte size of metadata in bulk requests in bytes" with something like "Size of metadata per bulk request (bytes)" and likewise change the other Help value to "Size of payload per bulk request (bytes)"; ensure both use consistent capitalization and unit notation.

coderabbitai · 2025-12-12T08:03:24Z

proxy/bulk/metrics.go

 	inflightBulks = promauto.NewGauge(prometheus.GaugeOpts{
 		Namespace: "seq_db_ingestor",
 		Subsystem: "bulk",
-		Name:      "in_flight_queries_total",
-		Help:      "",
+		Name:      "in_flight_queries",
+		Help:      "Number of bulks that are being processed by ingestor currently",
 	})


⚠️ Potential issue | 🟡 Minor

Metric name/Help mismatch: in_flight_queries tracks bulks, not queries.
Either rename to something like in_flight_bulks / in_flight_requests, or adjust Help to match “queries” if that’s truly what’s being counted.

🤖 Prompt for AI Agents

In proxy/bulk/metrics.go around lines 9 to 14 the metric Name "in_flight_queries" conflicts with the Help which says it counts bulks; change the metric to consistently represent bulks: rename Name to "in_flight_bulks" (or "in_flight_requests" if preferred), update Help to "Number of bulks that are being processed by ingestor currently", and update any code, tests, dashboards, or alerting rules that reference the old metric name to use the new name.

coderabbitai · 2025-12-12T08:03:24Z

storeapi/grpc_bulk.go

 var inflightBulksTotal = promauto.NewGauge(prometheus.GaugeOpts{
 	Namespace: "seq_db_store",
 	Subsystem: "bulk",
-	Name:      "in_flight_queries_total",
+	Name:      "in_flight_queries",
+	Help:      "Number of bulk request being currently processed by store",
 })


⚠️ Potential issue | 🟡 Minor

Fix Help grammar (and optionally align wording across components).

Current: “Number of bulk request being currently processed by store” → suggest:

var inflightBulksTotal = promauto.NewGauge(prometheus.GaugeOpts{ Namespace: "seq_db_store", Subsystem: "bulk", Name: "in_flight_queries", - Help: "Number of bulk request being currently processed by store", + Help: "Number of bulk requests currently being processed by the store", })

🤖 Prompt for AI Agents

In storeapi/grpc_bulk.go around lines 20 to 25, the metric Help string grammar is incorrect; update the Help value to a grammatically correct and consistent phrasing such as "Number of bulk requests currently being processed by the store" (or another aligned wording used across components), ensuring pluralization, article usage, and clarity match other metrics.

eguguchkin · 2025-12-23T09:32:02Z

nit:

Suggestion: we should have a development guidlines file where we could store rules like "counters must end with _total" and "metrics should never end with _total"

+1

Is .github/CONTRIBUTING.md the right place for this?

cheb0 · 2025-12-23T10:55:19Z

Added a new section "General Coding Guidelines" to CONTRIBUTING.md with a point that metrics must follow the convention.

eguguchkin · 2025-12-23T09:46:52Z

frac/processor/search_stats.go

 		Namespace: metricsNamespace,
 		Subsystem: metricsSubsystem,
-		Name:      "sources_total",
+		Name:      "sources",


Since we're renaming it anyway, let's pick a more descriptive name connected with lids.

from now on it's lids_loaded

eguguchkin · 2025-12-23T09:54:07Z

frac/fraction.go

 		Namespace: "seq_db_store",
 		Subsystem: "search",
-		Name:      "tracer_fraction_agg_search_sec",
+		Name:      "tracer_fraction_agg_search_seconds",


Lets remove "tracer_" prefix? Here and below

eguguchkin · 2025-12-23T15:18:51Z

metric/metric.go

 		Namespace: "seq_db",
-		Name:      "version",
-		Help:      "",
+		Name:      "version_info_total",


I think we might be using a counter here unnecessarily? Maybe we should have used a gauge instead?

Or rename the metric to version_starts_total.

thing is, I'd avoid changing type since we need to maintain both old and new in a single dashboard. I'm not sure if we can do that. Maybe in a different PR

eguguchkin · 2025-12-23T15:19:19Z

metric/metric.go

-		Name:      "version",
-		Help:      "",
+		Name:      "version_info_total",
+		Help:      "Number of seq-db instances with a particular version",


And accordingly, the description is something like: always 1 for the current version.

updated a desc

eguguchkin · 2025-12-23T15:21:28Z

metric/store.go

-		Name:      "meta_bytes_total",
-		Help:      "",
+		Name:      "meta_bytes",
+		Help:      "Byte size of metadata in bulk requests in bytes",


Rabbit is right about this.

eguguchkin · 2025-12-23T15:26:21Z

proxy/bulk/metrics.go

 		Subsystem: "bulk",
-		Name:      "in_flight_queries_total",
-		Help:      "",
+		Name:      "in_flight_queries",


I agree with rabbit here as well. And probably we can call it just in_flight.

eguguchkin · 2025-12-23T15:38:44Z

storeapi/grpc_bulk.go

 	Namespace: "seq_db_store",
 	Subsystem: "bulk",
-	Name:      "in_flight_queries_total",
+	Name:      "in_flight_queries",


we can just call it in_flight (since we're changing it anyway)

and code rabbit also suggested a fix in Help

cheb0 added 2 commits November 27, 2025 16:45

migrate to OpenMetrics standard

c5c3411

add help for all metrics

4cf6a04

eguguchkin requested review from dkharms and eguguchkin December 5, 2025 13:57

dkharms approved these changes Dec 12, 2025

View reviewed changes

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

add coding guidlines that metrics must follow OpenMetrics convention

827a666

eguguchkin approved these changes Dec 23, 2025

View reviewed changes

cheb0 added 2 commits December 24, 2025 11:59

Rename some metrics (PR review)

4e7347c

Merge branch 'main' into 280-open-metrics-migration

5d45557

chore: migrate metrics to OpenMetrics standard #281

Are you sure you want to change the base?

chore: migrate metrics to OpenMetrics standard #281

Uh oh!

Conversation

cheb0 commented Dec 1, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary by CodeRabbit

Uh oh!

codecov-commenter commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dkharms commented Dec 12, 2025

Uh oh!

coderabbitai bot commented Dec 12, 2025

Uh oh!

dkharms left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Dec 12, 2025

Walkthrough

Estimated code review effort

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

eguguchkin commented Dec 23, 2025

Uh oh!

cheb0 commented Dec 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cheb0 commented Dec 1, 2025 •

edited by coderabbitai bot

Loading

codecov-commenter commented Dec 1, 2025 •

edited

Loading