-
-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Time series databases care about cardinality for performance and scale, so many of them provide tools for imposing limits on the cardinality of stored time series. These tools are generally pretty crude from the end user’s perspective though.
For example, consider [Victoria Metrics](https://docs.victoriametrics.com/vmagent.html#cardinality-limiter). Their cardinality limiter has some great features:
- You can configure an hourly and daily limit on new time series.
- The limiter itself produces metrics describing its operation, so you can tell when metrics are being dropped and you can fire an alert when usage is approaching the limits.
- You can configure separate limits for each thing that produces metrics (a ”scrape target” in Prometheus parlance) so you can have a low limit for your database server and a higher limit for your Kubernetes control plane.
But if you’re not an observability infrastructure engineer, you get a pretty bad experience when the limits are hit: all your new time series are dropped on the floor. In practice, that might mean dashboards dropping to zero, alerts misfiring, and general mayhem. Not really what you want from your monitoring system. Can we do better?
Note: This document describes metric samples in the OpenMetrics format and uses PromQL to describe calculations you might do on those metric samples.
Why it’s hard
Pretend you’re a time series database and someone sends you a new time series above your cardinality limit. What might you do with it? There are a few options:
- Drop it.
- Find some other time series that hasn’t been used in a while and evict that one instead, making room for the new one.
- Aggregate the new time series with some sufficiently similar existing one so that cardinality doesn’t change but no data is completely dropped.
Let’s consider the user impact of each of these. Suppose you are monitoring a web backend app running on Kubernetes, you’ve got 3 pods in your deployment, and each pod produces these metrics:
http_requests_total{endpoint="/login", status="400", pod="app-1234"} 2
http_requests_total{endpoint="/login", status="500", pod="app-1234"} 3
http_requests_total{endpoint="/details", status="200", pod="app-1234"} 100
http_requests_total{endpoint="/details", status="400", pod="app-1234"} 2
http_requests_total{endpoint="/details", status="500", pod="app-1234"} 3
Let’s say you’ve also got a dashboard the total request rate in your app:
sum(rate(http_requests_total[5m]))
And also an alert based on error rate:
sum(rate(http_requests_total{code="500"}[5m]))/ sum(rate(http_requests_total[5m]))
That’s 6 time series per pod, and you’ve got 3 pods, so 18 time series total. Say that you configure your TSDB for a max of 25 time series. Now let’s deploy a new build and see what happens to your dashboard.
If your TSDB drops new time series…
When the new build goes out, your Kubernetes cluster temporarily runs 6 pods: 3 old pods that will soon stop getting traffic and be killed, and 3 new pods. The 3 old pods and 1 of the new pods can report their metrics without any trouble, but two of the new pods run into trouble: they want to write a total of 12 new time series, but the TSDB already has 24 active time series and can only accept 1 more. So one of the new pods can report one of its time series but the rest are dropped, and the last pod can’t report anything at all.
In your dashboard, the total number of requests drops significantly since almost 2/3s of your data is discarded. If you try to drill down by endpoint
or status
, you get misleading results, and according to your monitoring system, only 2 pods are running. It looks like a major outage even though everything is fine, and it persists until you raise the limit on your TSDB. If you do the reasonable thing and revert your deploy, now none of third batch of pods can report metrics and it looks like a total outage!
If you’re lucky, your monitoring system can tell you its overloaded and you know not to trust it. But many do not tell you, or rely on you configuring a warning ahead of time.
If your TSDB evicts old time series…
In practice, this is done with a time window. A time series is considered old if it hasn’t been reported in the last hour or something like that.
For the first hour after your deployment, your monitoring system gives misleading results much like in the first example. After an hour, the time series from the old nodes are outside the window, allowing enough room for the new ones to get in and everything is back to normal. But an hour is a long time.
If your TSDB aggregates similar time series…
Surely we can do better than the poor user experience in the previous examples. Let’s imagine that our TSBD can aggregate similar data on the fly when its overloaded, and that we’ve already configured it to be wary of the pod
dimension and prefer to aggregate this dimension away if need be.
With our smarter system, let’s try that deploy again. The time series limit is hit again, but this time the TSDB aggregates the data to prevent dropping anything. You end up with something like this:
http_requests_total{endpoint="/login", status="200", pod="TOO_MUCH_DATA"} 300
http_requests_total{endpoint="/login", status="400", pod="TOO_MUCH_DATA"} 6
http_requests_total{endpoint="/login", status="500", pod="TOO_MUCH_DATA"} 9
http_requests_total{endpoint="/details", status="200", pod="TOO_MUCH_DATA"} 300
http_requests_total{endpoint="/details", status="400", pod="TOO_MUCH_DATA"} 6
http_requests_total{endpoint="/details", status="500", pod="TOO_MUCH_DATA"} 9
Your dashboard looks fine, because it was already aggregating away the pod
dimension. Any drilldowns that rely on that dimension will stop working though.
This system degrades gracefully, but it needs some configuration to work in a reasonable way. If we were to remove that configuration and let the system decide what do on its own, it might decide to aggregate away the status
dimension instead which wouldn’t be great; a monitoring system that can’t tell you whether or not every request is failing is not useful.
Confident that you’ve solved your monitoring problems, you move on to the next task in your backlog: refactoring your APIs. Your endpoints are now of the form /<version>/<name>
, there’s a v1
and a v2
, and in addition to login
and details
you now have a summary
as well. You deploy and your TSDB has to start dropping time series again because now the explosion is happening on the endpoint
dimension, not pod
. Oops…
Takeways
- It’s really hard to present a consistent view of the system being monitored when cardinality limits are being hit, so it’s very important to tell the user when the monitoring system itself is in a degraded state to minimize confusion.
- For the same reason, the tried-and-true solution to this problem is to avoid hitting cardinality limits in the first place.
Things we can try
Now that you understand why no one has found a general purpose solution to this problem and the approach that Victoria Metrics takes is more or less the industry standard, there are some improvements we could explore:
- Specific Values: There’s some dimension that’s known to be high cardinality but some values are more interesting than others, so keep time series for a small number of specific values and aggregate the rest into an “everything else” bucket. Example: database queries by
project_id
but separate out interesting projects likeproject_id:1
and put the rest inproject_id:other
. - Top N: Instead of (or in addition to) selecting specific values, select the top N by value over some time window. Example: database queries by
project_id
by the top 100 projects over the past hour, and the rest inproject_id:other
. Note that you can’t trust a query likesum(database_query_total{project_id=1}
to give you anything other than a lower bound, because some datapoints may be missing if that project moves in and out of the top 100 over time. - Dimension priority: Enforce a maximum number of time series for a given metric name by defining an ordering of dimensions from most important to least important, and then aggregating away the least important dimensions first. Example: in the example from the first half of this document,
status
is the most important,endpoint
is the next most important, andpod
is the least important. Even in the degenerate case where you have only 2 time series forhttp_requests_total
you could imagine using one to track 5XX statuses and the the other to track everything else. Such a scheme would give you decent visibility under extreme pressure.
The harder problems
Limit by monitored entity
Almost everything you monitor exposes more than one metric name. For example, a single web backend pod might have these metrics:
go_info
go_gc_duration_seconds
go_gc_seconds_count
go_gc_seconds_sum
http_requests_total
http_response_size_bytes_count
http_response_time_seconds_bucket
http_response_time_seconds_count
http_response_time_seconds_sum
It gets difficult to reason about the system’s behavior if some of these metrics are recorded for a given pod and others are dropped due to cardinality limits. For example, if only half of your pods are successfully reporting http_response_time_seconds_count
but all of them are reporting http_response_time_seconds_sum
, it would be easy come to an incorrect conclusion about your app’s performance.
You could avoid this problem if your monitoring system could guarantee that for each pod, it’s metrics are either 1) all stored successfully or 2) aggregated into an “everything else” bucket. Suppose you could tell your monitoring system that you’d like to see metrics for up to 10 pods, and beyond that, the metrics should all be aggregated effectively dropping the pod
dimension:
limit_by_unique_values:
key_label: "pod"
max_unique_values: 10
overflow_value : "OTHER_POD"
This would effectively defend against cardinality explosions in the pod
dimension, but you’d need to apply similar controls to every other dimension that could grow too. In the real world, where you’ve got multiple clusters, deployments, pod, endpoints, and so forth, it gets complex quickly.
Graceful degradation at query time
Say I add this query to my dashboard:
rate(http_requests_total{endpoint="/login", status="401"}[5m])
And then a cardinality explosion occurs and your monitoring system reacts by aggregating away the endpoint
dimension, i.e. the only value that label takes is OTHER_ENDPOINT
or something. That’s going to break this query, so from the perspective of my dashboard, it’s as if all the data is being dropped even though your monitoring system is trying to prevent exactly that outcome.
How could you do better? There are a few options:
- Once a label value is successfully recorded once, never aggregate away that label value in the future. This prevents breaking the query above, but it could lead to nonsensical outcomes. Imagine you totally refactor your app’s endpoints, the traffic on
/login
might drop to zero and all of your traffic gets aggregated intoendpoint:OTHER_ENDPOINT
forever; that’s pretty unfortunate. - Do the same as above, but also apply a time window. That query would still break eventually, just not immediately.
- Recognize that not all time series are equal. Usually time series queries are saved for one of two use cases: dashboards and alerts. Expose this information to your cardinality limiting mechanism so that such saved queries are never broken unless absolutely necessary.
- Have some way of pushing back when the user issues a now broken query. This could involve giving them feedback on which dimension is not available, directing them to event storage to answer their question differently, or something else.
The even harder problem
Monitoring systems often have two separate components: a time series database and an events database. Metrics are good at showing what is happening with the system being monitored, while events are good at explaining why things are happening.
In a hypothetical world with no overhead and no costs, you’d just store every event. Then you could answer both the “what” and the “why” with the same data. But we do have to worry about both overhead and costs, so by being cheap to record, transmit, and store, metrics give you useful visibility at a reasonable cost, as long as you can accept cardinality limitations.
What if you can’t accept those limitations? Are there any other options than storing every event? One way is to sample events, record the sampling rate in each event, and then weigh them by the sampling rate at query time. For example, suppose you have these HTTP request events from your web backend:
[{"endpoint": "/login",
"response_time": 123,
"status": 200,
"weight": 10},
{"endpoint": "/login",
"response_time": 41,
"status": 401,
"weight": 5},
{"endpoint": "/login",
"repsonse_time": 10354,
"status": 503,
"weight": 1}]
In this example, 5xx server errors are sampled at 100%, 4xx client errors are sampled at 50%, and everything else is sampled at 10%. From these events, you know that the backend handled approximately 16 requests during this time period and you can also calculate an approximate error rate and latency distribution. Fidelity is lost if the sampling rate is too low, but a system like this isn’t sensitive to cardinality like a time series database.
If you had both a time series database and an event database like this, your cardinality limiter has a better option than dropping data: offload it to the events database. In order for such a system work, you’d have to solve a series of hard problems:
- Somewhere in your system, you have to decide whether to represent a given event as metrics or an event. Where should that be? Once you decide, there’s no going back because aggregation into metrics is lossy. Note that many monitoring systems do a first round of metric extraction and aggregation within or very close to the app being monitored to keep overhead low so this is a significant challenge.
- You need some way of expressing queries that can be answered by either your time series database or your events database. When you write the query, you don’t know which system is going to answer your query.
- You need some mechanism for promoting events into metrics and demoting metrics into events.
- If data transitions between systems — e.g. some stuff that used to be represented as metric proves to have too much cardinality and gets demoted into events — the thing that executes queries needs to be stitch results from both systems together. The events portion of the results are approximate and that somehow needs to be exposed to the user in a way that doesn’t mislead them into incorrect conclusions.