With the adoption of AI, usage attribution, and chargeback use cases are on the rise. Modern businesses are also eager to gather and assign usage data to their customers and internal departments, to drive billing, sales, product development, and cloud cost analysis.
The FinOps Foundation also recently unveiled the initial draft of its FOCUS (Open Cost & Usage Specification). Why usage data can be complex, and what differentiates event metering from time series metrics?
Before delving into the intricacies of billing, analytics, and monitoring use cases, let's define what we mean by usage data. Usage describes someone consuming a good in a time period. For example, between 1 PM and 2 PM, Alice sent 100 SMS via Twillio API.
Usage is usually described for a time period instead of a single date because computers are fast, but humans are slow. Let’s look at some of the common use cases requiring usage data:
Billing: This necessitates accurate usage data as customers are charged based on legally binding contract terms. While the data dimensions are often limited, the cardinality is high as usage data must be tracked for every customer.
Real-time data is optional, but prompt notifications are required when a user reaches a billing threshold. Data retention is crucial for validating bills, although this becomes less important once the invoice is settled.
Monitoring: This requires real-time usage data for alerting purposes. Accuracy is important but is more flexible than billing. Monitoring systems are often limited around cardinality.
Data retention is usually short due to the costs of storing large volumes of monitoring data, which is rarely utilized after a few weeks.
Analytics: Typical use cases such as cloud cost, margin analysis, and pricing necessitate accurate historical data from the past three to five years to train models and identify trends effectively. Analytics is rarely real-time.
Summarized as a table:
Use-Case |
Accuracy |
Cardinality |
Real-Time |
Retention |
---|---|---|---|---|
Billing |
High |
High |
Moderate |
1-2 Years |
Monitoring |
Moderate |
Low |
High |
Weeks |
Analytics |
High |
Moderate |
Low |
3+ Years |
As you can see, each use case has different needs, which can be confusing when discussing usage data.
The concept of classifying data as auditable or operational was first brought to my attention in 2018 through a tweet by Charity Majors, cofounder of Honeycomb.io.
Auditable data is classified as such when the loss of any data record is intolerable, and full retention of records is necessary. When utilizing an auditable dataset, it is expected to be comprehensive and complete.
Examples of auditable data include transaction logs, replication logs, and billing/finance events.
Operational data, conversely, doesn't require strict completeness. To maintain manageable costs, sampling is often employed, and some degree of data loss is acceptable.
Tools designed to manage operational data often prioritize effort efficiency, bypassing retries and costly guarantees of exactly one delivery. Examples of operational data include telemetry, metrics, and contextual data that describe each request and system component.
Before deciding on the methodology for collecting, processing, and storing your usage data, it is important to determine whether your data needs to be auditable or operational.
In the following section, we will compare two data collection strategies: event-driven metering, generally better suited to auditable use cases, and time-series monitoring, the preferred method for collecting operational usage data.
There are two main ways to collect usage data:
event-driven metering and
time-series monitoring systems.
Here's how they compare:
Event-driven metering: Usage-based billing companies favor this approach as it is auditable due to its inherent consistency in handling unique events. Events can be double-delivered in distributed systems and deduplicated using unique identifiers to prevent over or under-billing.
Metering deals well with high cardinality, which is necessary to track every customer's usage. The challenge, however, lies in data collection. The industry has robust infrastructure collectors for monitoring, but these were designed with something other than events in mind.
Most vendors provide a POST API for event submission, leaving the collection process up to the user.
Time-series monitoring: Monitoring systems like Prometheus scrape counters and histograms to store and supply metrics as time-series operational data.
Keeping cardinality low is advised, making it difficult to track individual user resource consumption on a large scale. Metrics collection is a well-paved path in the industry, with out-of-the-box metrics extractors available for most infrastructure components.
APM vendors have invested significantly in standards like OpenTelemetry to streamline data collection. The challenge lies in the metrics collector's limited guarantees around delivery and deduplication since they were designed with operational data use cases in mind.
Prometheus contributors share some thoughts about accuracy here. If you want to dig deeper, you can also find some debate about adjusting scraping to increase counter accuracy here.
Summarized as a table:
Collecting Usage |
Auditable |
Consistency |
Collectors and Standards |
---|---|---|---|
Event Metering |
Yes |
High |
Low |
Time-Series Metrics |
No |
Moderate |
High |
The current challenge lies in collecting and integrating usage data. These tasks are complex because usage collection must balance accuracy, cardinality, and real-time aspects differently per use case (as illustrated in the events vs. metrics comparison), while integration is time-consuming due to the need for a usage specification standard.
Just think about all the custom vendor APIs or the generic PromQL interface. This lack of consolidation creates difficulties in integrating usage data into billing, chargeback, and cost analysis use cases, often resulting in separate systems for usage data collection rather than sharing amongst each other.
FOCUS (Open Cost & Usage Specification) by FinOps aims to address the integration challenges of usage data. FOCUS outlines a specification for producing and consuming normalized usage, and billing data by Cloud Providers and SaaS vendors.
FOCUS will allow you to seamlessly integrate usage data between vendors, for billing, and cloud cost analysis use cases.
FOCUS specification is currently under development; the 0.5 preview version was just released at the end of June 2023, and the specification is currently focused more on billing than usage data.
You can follow or join the FOCUS working group here.
I don't anticipate a convergence of event metering and metrics systems as they each balance distinct business and engineering trade-offs to cater to their use cases. Just think about the differences between auditable and operational data.
But I do expect convergence on standards around integrating usage data between vendors like FinOps’s FOCUS.
We need your input. Should OpenMeter ingest metrics and integrate with Prometheus to streamline billing and chargeback use cases?
Let us know in our open-source repository: https://github.com/openmeterio/openmeter