As you hopefully know from reading some of my recent articles, I really enjoy writing Kubernetes operators! I've written and also just recently about how to build them at the inaugural in Portland, Oregon. So here goes another operator-related article, although this one is about one of the later stages of operator development: performance tuning. a bit about them on this blog presented a talk FOSSY conference Since I've been working on an operator that includes multiple controllers, each with four reconcilers running concurrently that make API calls to resources outside of the cluster, I wanted to start profiling my reconciliation loop performance to ensure that I could handle real-world throughput requirements. Prometheus Metrics offers as part of its default scaffolding. This provides a great way to profile and monitor your operator's performance under both testing and live conditions. Kubebuilder the ability to export operator-specific Prometheus metrics Some of these metrics (many on a per-controller basis) include: A histogram of reconcile times Number of active workers Workqueue depth Total reconcile time Total reconcile errors API server request method & response counts Process memory usage & GC performance I wanted a way to monitor these metrics for controllers that are under development on my local machine. This means that they are running as processes outside of any cluster context, which makes it difficult to scrape their exposed metrics using a standard Prometheus/grafana setup. operator-prom-metrics-viewer So I thought, why not craft a little GUI to display this information? Inspired by the absolutely incredible , I decided to make a terminal-driven GUI using the . k9s project tview library Since I'm not (yet) storing this information for further analysis, I decided to only display the latest scraped data point for each metric. But I also didn't want to implement my own Prometheus metric parser to do this, so I imported pieces of the Prometheus to handle the metric parsing. All I had to do was implement a few interfaces for storing and querying my metrics to interop with the scrape library. scrape library itself Prometheus in-memory, transient storage The Prometheus storage and query interfaces are relatively simple, and it helps that we actually don't need to implement all of their methods! Storage The underlying storage mechanism of my class is a simple map along with a mutex for locking it during multithreaded IO. In this map, we are only storing the latest value for each metric, so there's no problem overwriting a key's value if it already exists. And we also don't need to worry about out-of-order writes or any other time-series database problems. InMemoryMetricStorage type InMemoryAppender struct { data map[uint64]DataPoint mu *sync.Mutex } The struct maps directly to a Prometheus metric's structure. If you've worked with Prometheus metrics before, this should look pretty familiar to you. DataPoint type DataPoint struct { Labels labels.Labels Timestamp int64 Value float64 } To populate each DataPoint's key in the map, we can simply use the Prometheus builtin function to generate a unique key for each DataPoint, without having to do any extra work on our end. uint64 labels.Hash() To demonstrate, here's an example of a Prometheus metric: # HELP http_requests_total Total number of http api requests # TYPE http_requests_total counter http_requests_total{api="add_product"} 4633433 And how it would be represented by the struct: DataPoint d := DataPoint{ Labels: []Label{ {Name: "api", Value: "add_product"}, }, Value: 4633433, Timestamp: time.Now(), } Now that we have our data model, we need a way to store s for further retrieval. DataPoint Appender Here's the interface of the storage backend appender: type Appender interface { Append(ref SeriesRef, l labels.Labels, t int64, v float64) (SeriesRef, error) Commit() error Rollback() error ExemplarAppender HistogramAppender MetadataUpdater } Because we're building a simple in-memory tool (with no persistence or availability guarantees), we can ignore the and methods (by turning them into no-ops). Furthermore, we can avoid implementing the bottom three interfaces since the metrics that are exposed by Kubebuilder are not exemplars, metadata, or histograms (the framework actually uses gauges to represent the histogram values; more on this below!). So this only leaves the function to implement, which just writes to the 's underlying map storage in a threadsafe manner. Commit() Rollback() Append() InMemoryAppender func (a *InMemoryAppender) Append(ref storage.SeriesRef, l labels.Labels, t int64, v float64) (storage.SeriesRef, error) { a.mu.Lock() defer a.mu.Unlock() a.data[l.Hash()] = DataPoint{Labels: l, Timestamp: t, Value: v} return ref, nil } Querier The querier implementation is also fairly straightforward. We just need to implement a single method against our datastore to extract metrics from our store. Query My solution is far from an optimized inner loop; it has awful exponential performance in the worst case. But for looping through a few metrics at a rate of at most once per second, I'm not overly concerned with the time complexity of this function. func (a *InMemoryAppender) Query(metric string, l labels.Labels) []DataPoint { a.mu.Lock() defer a.mu.Unlock() dataToReturn := []DataPoint{} for _, d := range a.data { if d.Labels.Get("__name__") != metric { continue } var isMatch bool = true for _, label := range l { if d.Labels.Get(label.Name) != label.Value { isMatch = false } } if isMatch { dataToReturn = append(dataToReturn, d) } } return dataToReturn } Histogram implementation Since the controller reconcile time histogram data is stored in Prometheus gauges, I did need to write some additional code to transform these into a workable schema to actually generate a histogram. To aid in the construction of the histogram buckets, I created 2 structs, one to represent the histogram itself, and the other to represent each of its buckets. type HistogramData struct { buckets []Bucket curIdx int } type Bucket struct { Label string Value int } For each controller, we can run a query against our storage backend for the metric with the label. This will return all histogram buckets for a specific controller, each with a separate label ( ) that denotes the bucket's limit. We can iterate through these metrics to create objects, append them to the slice, and eventually sort them before rendering the histogram to keep the display consistent and in the correct order. workqueue_queue_duration_seconds_bucket name=<controller-name> le Bucket HistogramData.buckets The field on the struct is only used internally when rendering the histogram, to keep track of the current index of the slice that we're on. This way, we can consume buckets using a loop and exit when we return the final bucket. It's a bit clunky, but it works! curIdx HistogramData buckets for v1.0 After all of this hacking, I ended up with something that looks like this: It's far from perfect; since I don't manipulate the histogram values (yet) or have any display scrolling, it's very likely that the histogram will spill off the right of the screen. The UX can also be improved since you can only switch the controller that is being displayed by pressing the and buttons. Up Down But it's been a fantastic tool for me when debugging slow reconciles due to API rate limiting, and it has also highlighted errors for me when I miss them in the logs. Give it a whirl, and let me know what you think! Link to the GitHub: https://github.com/sklarsa/operator-prom-metrics-viewer Also published . here