Introduction to Kubernetes Observability
Kubernetes has become one of the most popular ways to deploy applications in the cloud or bare metal environments. By leveraging the features of Kubernetes, developers, DevOps engineers, Site Reliability Engineers (SREs), and closely related disciplines can deploy applications that scale based on demand, are containerized, thus making them easy to continuously develop, and create an unprecedented user experience. However, with all these perks, the challenge of maintaining a flexible infrastructure remains. This task often falls on engineers who receive notifications about outages, need to troubleshoot systems they didn’t build themselves, and are left with ensuring that the application is available for the end user.
Observability is a general concept based on three key IT and software pillars - metrics, logs, and traces. These three elements are what engineers rely on to ensure that the system is running smoothly or when there’s an outage.
The official Kubernetes documentation covers three sets of metrics available to the user out of the box - stable metrics, beta metrics, and alpha metrics. This classification was introduced as of Kubernetes release 1.26 and places identifiers as follows:
- Stable Metrics - Guaranteed to be maintained by the team; end-users can rely on these to remain in place as new versions of K8S are released.
- Beta Metrics - These metrics are tied to features that are currently being developed. This implies that they can be changed or removed altogether from future releases. In other words, be cautious when writing code and building applications that rely on these metrics.
- Alpha Metrics - As the name suggests, the alpha metrics are in the early development stages. They’re available for use but are likely to change with future releases.
The easiest approach to start collecting basic Kubernetes metrics can be accomplished via a metrics scraper. Kubernetes provides an official implementation of such a tool. The Kubernetes Metrics Serve will collect the metrics from the nodes / pods and present them in the Kubernetes Dashboard, which the end-user can access.
Application logs allow engineers to understand exactly what’s happening at every step of an application. Although they’re highly descriptive and accurate in the information conveyed, they’re also difficult to parse through and understand what’s relevant and what isn’t. In other words, logs are notoriously difficult to parse without automated scripts and / or tools.
When it comes to Kubernetes, logging is different from traditional servers or virtual machines - this difference is primarily due to the cluster, node, and pod management practices native to K8S. The controller node in Kubernetes will manage the lifecycle of pods which results in different actions based on them being in the evicted, crashed, deleted, or scheduled states. This leads to the logs associated with those pods to be cleaned up shortly after the system deems them to be obsolete. In other words, since one of the main features of Kubernetes pods is to be ephemeral, you’re going to manage the logs of those pods differently than you would in systems that don’t have this functionality. This becomes a greater challenge as your production systems spread multiple workloads across different domains, machines, nodes, clusters, etc.
Basic Loggin in Kubernetes Using Stdout and Stderr
In traditional systems, you’ll often see a logging mechanism that would write the logs as a set of events into a single file (Ex: app.log). The end-user can view this file using a text editor, or a parser that would allow them to filter through the events at any given time. As described above, this becomes troublesome for Kubernetes, as we’d need a set of logs for each pod across different clusters, namespaces, etc. Developers at Kubernetes thus created a logging framework that captures standard output (stdout) and standard error output (stderr) of each container and stores these logs into a log file. You can view the log file by issuing the following command:
As we discussed in a separate article, it’s possible to modify the base command to access different filters. For example, you may want to retrieve the logs for the last 3 hours, or 5 minutes. You may want to retrieve logs continuously in your command line to see what’s happening in real time. You may also want to store a section of your logs in a file. By modifying the logs kubectl command, you can get more granularity.
Docker is the underlying engine for Kubernetes containerized applications. Docker will stream events / logs via stdout and strderr and log on the host. In practice, this means that Kubernetes must aggregate logs from these endpoints for the entire cluster. Depending on your operating system, services, applications, and settings, you should have access to different kernel and systemd logs. You can access these logs for a specific container by issuing the following command:
Here’s a further breakdown of modifiers you should know about when retrieving logs using the command above:
List the logs for all pods within a particular namespace:
List the logs for a single pod within a particular namespace:
List the logs for a distinct container in a pod:
List the logs for a specific node:
List the logs of the kube-apiserver:
List the logs of the kube-controller-manager:
List the logs of the kube-scheduler:
Distributed Tracing in Kubernetes Observability is an entirely different beast. Tracing is the concept of understanding the path a specific message goes through across the system. As you may already know, this is much easier said than done.
If you’re unfamiliar with the concept, let’s briefly discuss an example of what would happen to a user that is accessing an e-commerce website.
Step 1 - The user is served the UI from which he can browse various products.
Step 2 - The user is prompted to create an account. A registration / authentication service is executed.
Step 3 - The user is redirected to the selection page from which he is served different products stored in a database.
Step 4 - The user chooses products and ads them to their shopping cart - back-end service that manages the shopping functionality.
Step 5 - The user goes through a checkout process that includes validation of their payment method, shipping information, encryption of identifying information, etc.
Step 6 - The user leaves the store. At this point, the database is updated with the order, the ERP system is updated with the quantity of items to release, and the shipping department is notified.
The example above is a fairly simple walkthrough of which services are initiated at some point in time during a normal system interaction.
Distributed Tracing in Kubernetes Observability
Logs offer information about a specific node, pod, container, or application. Tracing allows engineers to understand the relationships and interactions between those services. As systems scale in complexity, it becomes difficult to troubleshoot with logs alone; traces provide an additional perspective of what happened during an outage and allows engineers to pinpoint not only where the problem occurred, but where it originated. Furthermore, traces can be used to improve the overall system performance. In the example we discussed above, you may recall that the user was sent to the payment processing gateway in Step 5. While a payment gateway crash may indicate an issue with the service, the root cause may have stemmed elsewhere. For example, the authentication service may have sent a payload that wasn’t correctly parsed causing an error. The log of the payment gateway would simple indicate that there was a parsing error within the service. The race would reveal that the originating service of that message was the authentication service.
The Challenges of Kubernetes Observability
As we’ve briefly discussed in the previous sections, it’s important to understand and emphasize the challenges that arise in Kubernetes environments. Here’s a brief overview of challenges engineers encounter in production environments when it comes to Kubernetes Observability.
Modern DevOps and CI/CD best-practices allow developers to release their builds on a much more frequent basis than ever before. It’s not uncommon to see companies push for hundreds of code deployments every single week. These changes result in the need to continuously monitor not only the impact these changes made on the application they’ve been deployed to, but also the neighboring ones. In other words, a constant release cycle makes it difficult to pinpoint which services are creating issues for the SRE and DevOps personnel.
Large organizations have adopted a distributed model and operate on a variety of providers and physical hardware. The underlying hypervisor tools that manage the deployment and management of these applications, including Kubernetes, are going to “do their best” at ensuring that the hardware is completely abstracted. This typically means that pods, nodes, and containers will be continuously monitored, killed, re-deployed, etc. This process creates a level of complexity that isn’t present in simpler or non-distributed systems.
The Key Opportunities of Kubernetes Observability
Kubernetes Observability is a journey rather than a destination. You may choose to pick one, or multiple approaches to implement at your organization based on system complexity, needs, and business objectives.
Data Collection & Dashboards
Users can collect metrics, logs, and traces using various native and 3rd party tools designed for Kubernetes environments. By adding a collection and presentation system to your clusters, you can solve the following:
- Deployment & Customization - Depending on the provider, you may need to have your engineering teams deploy and customize the type of data the application collects and presents. It’s important to note that, in general, the more flexibility a platform offers, the more customization and upfront investment it’s going to take to set up and deploy.
- Management - Once deployed, metrics and log dashboards provide a way to manage and improve the cluster's performance. In most instances, you’ll be able to set different alert thresholds, notifications for your team, and ways to collect information about the general state of your cluster.
- Debugging - The biggest opportunity when it comes to getting a data collection and dashboarding services in place is the impact on debugging your system during an outage. It’s no secret that site reliability engineers and DevOps teams spend a great deal of their time tracing and eliminating the issues on the infrastructure. This process can be fairly complex and seek to answer questions such as “what’s the nature of the problem?”, “where has the problem originated?”, “who is most directly in charge of the code that caused the problem?”, “who needs to be contacted to help solve the issue?”, etc.
- Monitoring - You can get a ton of information about your cluster, nodes, pods, and applications by issuing various commands in CLI. Most of us can’t process a large amount of information in written format. Properly configured dashboards allow you to tell which areas of your infrastructure are running properly and which aren’t. They should also convey a starting point or information about the next steps in understanding that issue.
Data Correlation & Business-Specific Data
Data correlation is the action of creating links or context between different sets of data. This complex endeavor will typically require system / business knowledge and a specific use case the teams can tackle. From our conversations, technical leaders often struggle with getting metrics and data out of their Kubernetes clusters that go beyond the information about the technical aspects of the infrastructure.
Let’s discuss a few use cases.
- Business Impact - by enriching the data you’re collecting from your cluster with business intelligence (often added by an operator / engineer), you can understand an outage's impact on your business. For example, if you’re monitoring a service that runs user authentication, you may want to label it as such and thus emphasize making sure that the service doesn’t go down. Alternatively, you may have a service that doesn’t cause disruption of service or a breach of SLAs, which you can thus label as less critical for your Kubernetes Observability practice.
- Root Cause Analysis - the metrics collected from your systems will allow you to understand the state of your system. By correlating this data for your specific use-case, you should be able to better identify the root cause of certain issues. For example, as developers instrument their code, you’ll not only be able to tell when a pod or node go down, but you’ll be able to map the data from the instrumented functions and understand what brought down that specific node or pod.
- Detecting Anomalies - in many complex systems, it’s very difficult to find the cause of an intermittent issue. If it can’t be reproduced, chances are engineers won’t figure out where the problem originated. By creating “logic traps” that would trigger as certain system thresholds are met, it’s possible to better understand what caused a certain anomaly.
- Monitoring & Understanding Trends - by collecting data over a long period of time, you should see patterns emerge. From those patterns you and your team will be able to understand the system better, optimize certain areas and ensure that they are running better, and benchmark your components against each other.
Selecting Kubernetes Observability Tools
There’s a variety of options when it comes to Kubernetes Observability. As you’d expect, there’s a wide range of costs, complexity, and scalability among those options. For our clients that run smaller infrastructures, we always recommend looking into open-source solutions first. Deploying something as simple as Prometheus and Grafana to get baseline metrics of your operation is a big benefit at a low cost / effort. As you start to scale your team and require additional metrics, alerting mechanisms, detailed view of your metrics / logs / traces, perhaps it’s time to look into an enterprise grade solution - Ex: Datadog, New Relic, Dynatrace, etc. Furthermore, there are many startups in this ecosystem building tools that aim to disrupt the observability market using various technologies - Ex: OpenTelemetry, ebpf.
Community & Support
Chances are, your team which includes engineers and developers will prefer tools that have a robust community behind it. There’s nothing more frustrating than trying to get something to work without the ability to reach out to your colleagues and peers. Evaluate that the solution you’re committing to has a network of customers and developers that use it, are familiar with it, and are able to land a hand when you experience difficulties.
UI / UX / Ease of Operation
There’s something to be said about complex software and hardware systems. You’re looking for a solution that would simplify the troubleshooting activity, and shorten the learning curve of yet another platform / tool. Get the solution in the hands of your engineering teams and let them “play around with it.” What are their thoughts on the UI / UX? Does it play well with other tools?
There’s been a lot of development in the Kubernetes Observability space over the last few years. The reality is that some of the earlier integrations require a substantial amount of effort from developers as they deploy their applications. In other words, you have a continuous cost of scaling these tools / solutions. On the other hand, various protocols have matured and allow for automated discovery of newly deployed systems thus making it easier on those that don’t deal with infrastructure all day long. Understand what it will take for you to scale beyond what you have today as you choose a specific solution, and understand the costs / effort it will require for you to migrate to another platform in the future.
Conclusion on Kubernetes Observability
Observability is the practice of understanding a complex software / hardware system via metrics, logs, and traces. These three pillars each provide a different viewpoint that allow engineers and developers (typically in SRE and DevOps teams) to understand the current state of a system. With this information, they can troubleshoot issues, optimize their infrastructure / applications, mitigate costs, schedule software deployments, etc.