Skip to content

Conversation

@cheb0
Copy link
Member

@cheb0 cheb0 commented Dec 1, 2025

Description

Migrated metrics to OpenMetrics standard. I will list the primary rules of the standard and what has been changed.

  • Counter metrics must have _total suffix in their name. I added _total suffix to all counters where it's necessary.
  • All other metrics (except for counters) must not have any of the following suffixes: _total, _count, _created, _sum, _bucket. I removed _total suffix from all gauge/hist metrics. There was a single metric with _count suffix, which was also renamed.
  • It's recommended to have unit types at the end of the metric if appropriate: seconds, bytes, fractions, etc. I added _seconds or _bytes where appropriate. For oldest entry timestamp_seconds are units which are recommended to use. There is prometheus naming best practices: https://prometheus.io/docs/practices/naming/
  • It's recomended to avoid short names and abbreviations for unit types like sec instead of seconds. I changed _sec to _seconds whenever I could.
  • It's recommended to avoid short terms if they are not commonly used. I replaced a few places like tracer_fraction_reg_search_sec -> tracer_fraction_regular_search_seconds in the metrics names, but cases like agg, docs and meta are used commonly and could be understood easily. I left them as it is. I also unified documents and docs in metric names - now docs term is used everywhere.
  • It's not recommended to have metric type in the name. We do not have such metrics. Some metrics have hist name in them, but it's acceptable since they do not describe metric type but rather an operation type (search with hist param).
  • It's recommended to have help for each metric. I added help for all metrics, let's review if help makes sense.
  • snake_case is recommended for metric names and labels. Already done.
  • le и quantile are forbidden names for labels. We didn't have any of them.

I also found the following cases:

  • metrics TokenizerTokensPerMessage, TokenizerParseDurationSeconds are not used currently
  • skippedIndexes, skippedIndexesBytes (used for tokenization) are currently located in metric/store.go, i.e. place where store metrics. perhaps we could move them to ingestor or tokenizer metrics.go file
  • skip_shard_total - seems like an inconsistent name since we choose a shard for storing but loop through replicas and skip a replica if it was already chosen for storing the current bulk.

Suggestion: we should have a development guidlines file where we could store rules like "counters must end with _total" and "metrics should never end with _total"

Fixes #280


  • I have read and followed all requirements in CONTRIBUTING.md;
  • I used LLM/AI assistance to make this pull request;

If you have used LLM/AI assistance please provide model name and full prompt:

Model: {{model-name}}
Prompt: {{prompt}}

Summary by CodeRabbit

  • Documentation
    • Enhanced Prometheus metrics documentation with clearer help text across multiple modules, improving observability and monitoring clarity.
    • Standardized metric naming conventions with more explicit suffixes (_total, _bytes, _seconds) for better metric identification and consistency in monitoring systems.

✏️ Tip: You can customize this high-level summary in your review settings.

@codecov-commenter
Copy link

codecov-commenter commented Dec 1, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.94%. Comparing base (4c86707) to head (5d45557).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #281      +/-   ##
==========================================
- Coverage   71.06%   70.94%   -0.13%     
==========================================
  Files         201      201              
  Lines       14773    14773              
==========================================
- Hits        10498    10480      -18     
- Misses       3527     3541      +14     
- Partials      748      752       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dkharms
Copy link
Member

dkharms commented Dec 12, 2025

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Dec 12, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Member

@dkharms dkharms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :shipit:

@coderabbitai
Copy link

coderabbitai bot commented Dec 12, 2025

📝 Walkthrough

Walkthrough

This PR standardizes Prometheus metrics across 13 files to align with OpenMetrics naming conventions. Changes include renaming metric names to use standardized suffixes (e.g., _seconds, _bytes_total, _total) and adding descriptive Help text to previously undocumented metrics. No functional logic or control flow is modified; only metric metadata and identifiers are updated.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Areas requiring attention:

  • Verify metric name changes follow consistent OpenMetrics conventions across files (e.g., _total vs _bytes_total vs _seconds suffixes applied correctly)
  • Spot-check Help text accuracy and consistency in tone/clarity, particularly in files with many renamed metrics (metric/ingestor.go, fracmanager/metrics.go)
  • Confirm that metric name changes don't break external dependencies or dashboards that reference the old names
  • Verify renamed metrics like in_flight_queries_totalin_flight_queries are intentional and consistent with standard conventions

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change—migrating metrics to OpenMetrics standard—matching the objectives and scope of changes.
Linked Issues check ✅ Passed Code changes comprehensively implement OpenMetrics migration: counter suffixes added, help text added, unit suffixes applied, and naming conventions standardized across all modified files.
Out of Scope Changes check ✅ Passed All changes are scoped to metric naming, help text, and unit suffix alignment per OpenMetrics standard; no unrelated code modifications or functional changes detected.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
proxy/bulk/metrics.go (1)

22-28: Histogram Help is OK, but consider clarifying “per bulk” if that’s the intent.
Right now “Number of successfully stored docs” could be read as a global total (counter), while a histogram is usually “docs written per bulk/request”.

metric/metric.go (1)

25-30: counters_total Help is very generic.
If feasible, mention what name label values represent (event types) to make the metric self-documenting.

asyncsearcher/async_searcher.go (1)

46-70: Metrics renames/Help look consistent; tiny wording nit in read_only Help (“can not” → “cannot”).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dfd50b6 and 4cf6a04.

📒 Files selected for processing (13)
  • asyncsearcher/async_searcher.go (1 hunks)
  • bytespool/bytespool.go (1 hunks)
  • frac/active.go (1 hunks)
  • frac/fraction.go (2 hunks)
  • frac/processor/search_stats.go (1 hunks)
  • fracmanager/metrics.go (4 hunks)
  • indexer/metrics.go (2 hunks)
  • metric/ingestor.go (5 hunks)
  • metric/metric.go (1 hunks)
  • metric/store.go (1 hunks)
  • proxy/bulk/metrics.go (2 hunks)
  • proxyapi/http_bulk.go (1 hunks)
  • storeapi/grpc_bulk.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
proxyapi/http_bulk.go (1)
metric/metric.go (1)
  • SecondsBuckets (33-33)
frac/fraction.go (1)
metric/metric.go (1)
  • SecondsBuckets (33-33)
metric/store.go (1)
metric/metric.go (2)
  • SecondsBuckets (33-33)
  • SecondsRanges (37-37)
metric/ingestor.go (1)
metric/metric.go (1)
  • SecondsBuckets (33-33)
🔇 Additional comments (9)
frac/active.go (1)

202-208: Help text addition for seq_db_store_bulk_stages_seconds looks good.

indexer/metrics.go (1)

11-17: Metric rename + Help clarifications look consistent with OpenMetrics conventions.

Also applies to: 26-31, 33-39

frac/processor/search_stats.go (1)

17-51: Histogram renames + Help text additions are clear and appropriately scoped.

frac/fraction.go (1)

29-57: Seconds-suffix renames + Help additions look good; metric selection updated cleanly.

Also applies to: 60-70

proxyapi/http_bulk.go (1)

29-55: Bulk metrics Help additions + _total counter rename look aligned with OpenMetrics guidance.

bytespool/bytespool.go (1)

34-59: Bytes pool metric Help texts are clear and accurately describe semantics.

metric/ingestor.go (2)

9-45: Search metrics: naming/Help looks aligned (seq_db_ingestor_search_total, etc.).


188-194: Verify incoming_text_len_bytes is actually recording bytes (not runes/characters).
Since you renamed to _bytes, ensure the observed value uses len([]byte(s)) semantics consistently.

metric/metric.go (1)

11-17: Change version_info_total to a Gauge named version_info.

Prometheus best practice for version/build info is a Gauge with value=1 and labels for version/build fields (e.g., seq_db_version_info{version="..."} = 1). A Counter named version_info_total mixes semantics: "info" suggests static metadata, while "_total" and Counter type suggest a cumulative count. This breaks standard dashboard queries and recording rules that expect the *_info gauge pattern.

Refactor to use prometheus.NewGaugeVec with metric name version_info and set value to 1 in init() instead of incrementing.

Comment on lines 54 to 59
cacheMissTotal = promauto.NewCounterVec(prometheus.CounterOpts{
Namespace: "seq_db_store",
Subsystem: "cache",
Name: "miss_total",
Help: "Number of cache misses by layer",
}, []string{"layer"})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent counter name: miss_total should likely be misses_total.
This will be a breaking rename anyway; worth aligning with hits_total.

🤖 Prompt for AI Agents
In fracmanager/metrics.go around lines 54 to 59, the Prometheus counter is named
"miss_total" which is inconsistent with "hits_total" and Prometheus conventions;
change the metric Name to "misses_total" (and optionally update the Help text to
"Number of cache misses by layer") and then update any code, tests, dashboards,
alerts and documentation that reference the old "miss_total" metric to the new
"misses_total" name to avoid breakage.

Comment on lines 90 to 95
cacheSizeOccupied = promauto.NewCounterVec(prometheus.CounterOpts{
Namespace: "seq_db_store",
Subsystem: "cache",
Name: "miss_size_total",
Name: "miss_size_bytes_total",
Help: "Total size occupied in cache by layer in bytes",
}, []string{"layer"})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Type/semantics mismatch: miss_size_bytes_total (Counter) vs Help implying current occupancy.
If this is “current bytes occupied”, it should be a Gauge (e.g., size_bytes / occupied_bytes). If it’s “bytes added due to misses over time”, keep Counter but update Help to “Total bytes …”.

🤖 Prompt for AI Agents
In fracmanager/metrics.go around lines 90 to 95, the metric is declared as a
Counter named miss_size_bytes_total but the Help text describes current cache
occupancy; decide the intended semantics and fix: if this should represent
current bytes occupied per layer, change the metric to a Gauge (rename to
something like size_bytes or occupied_bytes) and update Name/Help accordingly;
if it should represent cumulative bytes added due to misses, keep a Counter but
update the Help string to clearly state it is the total bytes added due to cache
misses (e.g., "Total bytes added to cache due to misses, by layer").

Comment on lines 37 to 43
BulkMetaBytesTotal = promauto.NewHistogram(prometheus.HistogramOpts{
Namespace: "seq_db_store",
Subsystem: "bulk",
Name: "meta_bytes_total",
Help: "",
Name: "meta_bytes",
Help: "Byte size of metadata in bulk requests in bytes",
Buckets: prometheus.ExponentialBuckets(1, 4, 16),
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Tighten a couple Help strings (duplication / phrasing).

 BulkMetaBytesTotal = promauto.NewHistogram(prometheus.HistogramOpts{
   Namespace: "seq_db_store",
   Subsystem: "bulk",
   Name:      "meta_bytes",
-  Help:      "Byte size of metadata in bulk requests in bytes",
+  Help:      "Byte size of metadata in bulk requests",
   Buckets:   prometheus.ExponentialBuckets(1, 4, 16),
 })

 SearchDurationSeconds = promauto.NewHistogram(prometheus.HistogramOpts{
   Namespace: "seq_db_store",
   Subsystem: "search",
   Name:      "duration_seconds",
-  Help:      "Search request duration time (only successful searches)",
+  Help:      "Search request duration (only successful searches)",
   Buckets:   SecondsBuckets,
 })

Also applies to: 56-62

🤖 Prompt for AI Agents
In metric/store.go around lines 37-43 (and similarly lines 56-62), the Help
strings are wordy and duplicate units; tighten them to concise, consistent
phrases: replace "Byte size of metadata in bulk requests in bytes" with
something like "Size of metadata per bulk request (bytes)" and likewise change
the other Help value to "Size of payload per bulk request (bytes)"; ensure both
use consistent capitalization and unit notation.

Comment on lines 9 to 14
inflightBulks = promauto.NewGauge(prometheus.GaugeOpts{
Namespace: "seq_db_ingestor",
Subsystem: "bulk",
Name: "in_flight_queries_total",
Help: "",
Name: "in_flight_queries",
Help: "Number of bulks that are being processed by ingestor currently",
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Metric name/Help mismatch: in_flight_queries tracks bulks, not queries.
Either rename to something like in_flight_bulks / in_flight_requests, or adjust Help to match “queries” if that’s truly what’s being counted.

🤖 Prompt for AI Agents
In proxy/bulk/metrics.go around lines 9 to 14 the metric Name
"in_flight_queries" conflicts with the Help which says it counts bulks; change
the metric to consistently represent bulks: rename Name to "in_flight_bulks" (or
"in_flight_requests" if preferred), update Help to "Number of bulks that are
being processed by ingestor currently", and update any code, tests, dashboards,
or alerting rules that reference the old metric name to use the new name.

Comment on lines 20 to 25
var inflightBulksTotal = promauto.NewGauge(prometheus.GaugeOpts{
Namespace: "seq_db_store",
Subsystem: "bulk",
Name: "in_flight_queries_total",
Name: "in_flight_queries",
Help: "Number of bulk request being currently processed by store",
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix Help grammar (and optionally align wording across components).

Current: “Number of bulk request being currently processed by store” → suggest:

 var inflightBulksTotal = promauto.NewGauge(prometheus.GaugeOpts{
   Namespace: "seq_db_store",
   Subsystem: "bulk",
   Name:      "in_flight_queries",
-  Help:      "Number of bulk request being currently processed by store",
+  Help:      "Number of bulk requests currently being processed by the store",
 })
🤖 Prompt for AI Agents
In storeapi/grpc_bulk.go around lines 20 to 25, the metric Help string grammar
is incorrect; update the Help value to a grammatically correct and consistent
phrasing such as "Number of bulk requests currently being processed by the
store" (or another aligned wording used across components), ensuring
pluralization, article usage, and clarity match other metrics.

@eguguchkin
Copy link
Contributor

nit:

Suggestion: we should have a development guidlines file where we could store rules like "counters must end with _total" and "metrics should never end with _total"

+1

Is .github/CONTRIBUTING.md the right place for this?

@cheb0
Copy link
Member Author

cheb0 commented Dec 23, 2025

Added a new section "General Coding Guidelines" to CONTRIBUTING.md with a point that metrics must follow the convention.

Namespace: metricsNamespace,
Subsystem: metricsSubsystem,
Name: "sources_total",
Name: "sources",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're renaming it anyway, let's pick a more descriptive name connected with lids.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from now on it's lids_loaded

frac/fraction.go Outdated
Namespace: "seq_db_store",
Subsystem: "search",
Name: "tracer_fraction_agg_search_sec",
Name: "tracer_fraction_agg_search_seconds",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove "tracer_" prefix? Here and below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

metric/metric.go Outdated
Namespace: "seq_db",
Name: "version",
Help: "",
Name: "version_info_total",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might be using a counter here unnecessarily? Maybe we should have used a gauge instead?

Or rename the metric to version_starts_total.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thing is, I'd avoid changing type since we need to maintain both old and new in a single dashboard. I'm not sure if we can do that. Maybe in a different PR

metric/metric.go Outdated
Name: "version",
Help: "",
Name: "version_info_total",
Help: "Number of seq-db instances with a particular version",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And accordingly, the description is something like: always 1 for the current version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated a desc

metric/store.go Outdated
Name: "meta_bytes_total",
Help: "",
Name: "meta_bytes",
Help: "Byte size of metadata in bulk requests in bytes",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rabbit is right about this.

image

Subsystem: "bulk",
Name: "in_flight_queries_total",
Help: "",
Name: "in_flight_queries",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with rabbit here as well. And probably we can call it just in_flight.

Namespace: "seq_db_store",
Subsystem: "bulk",
Name: "in_flight_queries_total",
Name: "in_flight_queries",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can just call it in_flight (since we're changing it anyway)

and code rabbit also suggested a fix in Help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate metrics to OpenMetrics standard

5 participants