What you always have as a Kubernetes cluster is usually always the bare minimum. Be it a cluster you set up yourself in a cluster of RasberryPi’s or using Amazon EKS or any other form you set up your Kubernetes cluster, it is NEVER production-ready. The keyword here is production-ready. What do we mean when we say production-ready.

In very simple terms, it means it is ready to face the world. It is ready to face the good the bad and the ugly traffic that will come to it and still stand strong despite whatever comes to it. Putting this in perspective, it is quite important to find a bench-mark to work with in determining if our cluster is production-ready. The following checklist is my best recommendation to follow to know if your cluster is production-ready, https://learnk8s.io/production-best-practices.

In this article, I am looking at tools to have that will make your cluster production-ready. I shall be using the 5 pillars of the AWS Well-Architected Framework to categorize the different types of tools/services that you must have in the cluster to dim it fit for production. Take note, that this list is bare-minimum, the number of services you can deploy in your cluster is endless, but you do not want to deploy so much.

The Categories

The categories for the different tools are as follows:

Reliability and Availability
Security
Network, Monitoring & Observability
Backup/Recovery
Cost Optimization
Cluster Visualization

Reliability and Availability

These two terms are actually mean different things; reliability means that a system is doing what is designed to do effectively, while availability is when a system is accessible by the user. These two are very necessary for your applications in your Kubernetes cluster.

There are tools that can help your applications to be highly available.

Tools

Horizontal Pod Autoscaler (HPA)
Karpenter (https://github.com/aws/karpenter).
Cluster-Autoscaler (https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler).
Goldilocks (https://github.com/FairwindsOps/goldilocks).

While HPA is for scaling pods based on CPU and Memory metrics, Karpenter and Cluster-Autoscaler scale nodes are based on the pod capacity of the nodes. Goldilocks helps you recommend the appropriate CPU and Memory metrics for your pods based on how much has been consumed over a period of time.

Security

The security of your cluster is a very essential part of running workloads in the Kubernetes cluster. Security in Kubernetes is of two folds. First is the security from an external attack on your API server which is usually secured with certificates, but an extra later of putting the API server in a private network reduces the surface area of attack because it is not exposed to the public internet.

The other security operations have to do with scanning your cluster for proper configurations to ensure the pods running follow proper security standards. Here are some security tools that can help scan container-based workloads and Kubernetes clusters.

Tools

Snyk (https://snyk.io)
Kubelinter (https://github.com/stackrox/kube-linter)
Kubehunter (https://github.com/aquasecurity/kube-hunter)
Checkov (https://checkov.io)
Kube-Scan (https://github.com/octarinesec/kube-scan)

Network, Monitoring & Observability

What is a Kubernetes cluster without its network? The network is used to send and receive traffic within the cluster and outside the cluster. It is important to be able to sketch a flow of traffic within the cluster. It makes it easy to identify bottlenecks like; slow or unresponsive microservices. This is where a service mesh comes into play.

ServiceMesh Tools

Istio (https://istio.io)
Linkrd (https://linkrd.io)
AWS AppMesh (https://aws.amazon.com/app-mesh)
Consul (https://www.consul.io/)

Monitoring & Observability

The availability and Mean Time to Recover (MTTR), of your system is high proportionate to the depth of monitoring and observability in the system. Monitoring helps to collect logs and metrics of worker nodes and pods/applications running within the cluster. To know more about logs and metrics get my book on Amazon CloudWatch. There are many monitoring tools out there. But I will pick about three of them that stand out in different scenarios. Observability on the other end is a combination of Application Performance Management (APM), bug tracing and application tracing, with the focus on knowing what is going on within the application.

Monitoring Tools/Services

Prometheus (prometheus.io) — for collecting metrics.
Grafana (grafana.net) — for data visualization of data from different sources.
Amazon CloudWatch Container Insights.
Netdata (https://netdata.cloud).

Observability Tools

Amazon X-Ray (https://aws.amazon.com/xray/).
Datadog (https://datadog.com).
AWS CloudMap (https://aws.amazon.com/cloud-map/).
Jaegar (https://www.jaegertracing.io/).

Backup/Recovery

Backing up all components of your cluster can be a lifesaver in times of disaster or during a migration process. Deployments, Pods, Secrets, ConfigMaps, ClusterRole, ClusterRoleBinding, and all other resources in Kubernetes. The backup of the cluster is stored as a zip file in any of the popular object storage services such as Amazon S3, Azure Blob Storage, Google Cloud Storage. This data can also be retrieved at any time or to an identical cluster with a single command and all resources backed up will be automatically restored. It can also be configured to run a periodic backup of the cluster resources.

Tool(s)

Velero (https://velero.io)

Cost Optimization

Most times the cost which is tangible for a Kubernetes cluster in a cloud environment is usually the cost of the Master Node (Control Plane) and the worker nodes and other underlying infrastructure with which they are running on. These costs are clearly displayed in the Billing dashboard of the cloud provider. They can be optimized by choosing better pricing option for the nodes within the cluster. But you can granularly check cost from pod consumption and cascade that cost to numerical dollar value. This can give very interesting insights on application that are costing more to run.

Tool(s)

Kubecost (https://www.kubecost.com/)

Cluster Visualization

By deafult, Kubernetes has a dashboard. But most times this dashboard is barely used and it has very limited features which forces most Kubernetes Administrators to learn kubectl commands to get what they want. But sometimes, it can be time consumting, trying to find the right command to check for certain resources and perform a comparison or get quick feedback on the behavior of pods or services within the cluster. Some tools actually make managing Kubernetes cluster more fun.

Tool(s)

Lens (https://k8slens.dev/)
Rancher (https://rancher.com)

Conclusion

There is no perfect Kubernetes cluster, the tools mentioned here are suggestions of tools that you should have in your cluster to get the best out of it. At least one tools from the five sections mentioned is needed to have proper visibility and governance of your Kubernetes cluster.

Also published here.

Kubernetes Cluster Must-Haves To Be Production Ready

The Categories

Reliability and Availability

Security

Network, Monitoring & Observability

Backup/Recovery

Cost Optimization

Cluster Visualization

Conclusion