Company

What I Learned about Observability and Troubleshooting from 60+ Hours of Engineering Interviews

Here is what I learned about monitoring and troubleshooting production issues in Cloud applications from 60+ hours of interviews.

January 18, 2023

Introduction

Since starting Kerno, our ambition has been to create a product that delights users. Unfortunately, as Cliché might sound, many SaaS products don't - they struggle to gain any healthy adoption and instead begin to clutter the metaphorical shelf.

SaaS start-ups sacrifice user interviews. Instead, they opt for a "ship code and figure it out later" approach. Still, during the early days of product development, you can mitigate pain by investing in comprehensive user interviews upfront.

Some downstream pains I'm referring to here are vendors' "scramble" tactics to artificially tick up the DAU count. Don't worry; I am all for lunch and learn sessions, product newsletters, and webinars. Still, you can smell desperation - swag, coffee vouchers, and cinema tickets aren't a realistic or sustainable way to improve the user experience of your product. You lose credibility as a vendor, not to mention the $$$ satisfying caffeine addicts (I am one of those).

For the last six months, we have conducted more than 80 interviews with engineering teams scattered all around the globe from all types of industries, which have helped us prove some hypotheses and disprove others, ultimately shaping Kerno v0.1.

In this article, I want to share with you some of the key lessons I have learned in the hope that they may help you in your research, career, or personal project.

PS: I have only listed some of the learnings this time around - as you can imagine, during 80 interviews, a lot pops up!

Note: Interviews focus on Observability and troubleshooting production issues and with teams adopting or currently running on microservice architectures.

Lesson 1: Current Tooling is Making it Unnecessarily Harder

Ok, let me caveat this, current tooling is great… Sean!! It truly is - when you want to fulfill a request for an isolated service or primary system. So, for example, please give me the performance logs of my API gateway… here you go…great.

Now, please give me the cache hit ratio of the MongoDB Database, and here you go at your figure tips.

However, in MicroService architecture, when things go wrong, like in another interconnected system, it tends to be incredibly difficult to navigate and figure out the root cause.

Suppose you are lucky enough to have a platform team. In that case, they could spend months implementing the necessary logging, monitoring, and tracing tools and then more time stitching data together to get to an interoperable and usable state. From now on, they need to maintain this single source of truth - juggling multiple vendor contracts, licenses, and costs on their bespoke dashboard.

Again, that's if you are lucky enough to have a dedicated platform team… the majority of interviewees aren't.

Instead, the current modus operandi is to hop between multiple tools, datasets, and spreadsheets coupled with numerous slack chats and zooms to figure out what the f**ck is going on! As a result, time and resources are spent, while the episode creates a long last scar for the engineering teams.

One example that springs to mind involves a utility company.

🚨 A recent code change meant new users couldn't access the service. This went undetected since the number of users didn't meet the threshold in their error monitoring tool. A day later, the alarm was raised when a customer logged a ticket.

😤 The first line response was an ops engineer, who used their specific systems to try to figure out the root cause, but having no luck, called in the DevOps engineer who used their tools - no joy. Another 24 hours passed before they contacted the lead engineer, who spent 8 hours piecing information together from different tools and systems before he narrowed down the root cause.

📉 All in all, the issue from inception to root cause took 32 hours, not to mention the resources required to carry out the investigation.

💡 The interviewee outlined that if they had a service that provided them a unified view of services and dependencies, not only would the issue have been resolved quicker, it wouldn't have been an issue in the first place.

AGAIN, I think modern APM and Olly tools are great; we need them… but teams are crying out for a compliment to these tools - something lightweight that stitches it all together.

Lesson 2: Distributing Knowledge is Expensive

❓ Q: How did you know that service A contained the issue?

💡 A: We rang X; they know how this system works.

You wouldn't believe the number of interviews that had this answer word for word. But of course, I'd expect this in a small start-up with less than ten engineers, but the multinational organization surprised me (how naive)! 

I was initially perplexed, but after further dissecting it, the causation is pretty simple.

Distributing engineering knowledge is hard - it requires many resources and developers' buy-in.

Pressing deadlines and keeping the lights on is much more critical than the mild inconvenience of waiting on John to tell you what is happening with the system. What if John is on holiday, or worse, leaves the organization? Burying heads in the sand won't find the answer. 

Some people we've interviewed have learned this the hard way. The system goes down; no one seems to know what is happening, so a war room is called to figure out what is happening. Best minds aggregate, and hours/days later, we have found the root cause. (Note in some interviews, people claimed that management saw this as a victory, moved on, and the same thing repeatedly happened before taking action).

CTOs and Engineering managers are trying to mitigate this by introducing a service ownership model, where developers are accountable for the services they run, coupled with continuous knowledge sharing in the form of presentations, better documentation, and post-event memorandums. 

Although this is a great start, interviewees admitted it takes a lot of investment to maintain and manage this. In addition, the engineering teams require a disciplined investment to ensure that documentation is up to scratch, made harder by the manual upkeep required by current documentation solutions. However, one key takeaway from this portion of the interviews is that organizations need to take the first step, which is helping teams get a real-time date view of their services and ownership. That level of visibility is a great foundation to improve further.

Lesson 3: Dependencies Help Piece it all Together

Our interviews consisted of engineering teams that have fully adopted MicroService Orientated Architecture (MSOA) or are transitioning from the monolith. The promised land is quicker releases, deploying resources more effectively, and more bang for the buck.

However, it comes with tradeoffs.

From a delivery point of view, MSOA is untouchable compared to monolith architecture. For example, a feature release that historically took two months now takes less than 3 minutes.  See lesson 1.

A disrupted ecosystem means it's harder to understand the current state of services and dependencies. Having 00s or 000s of services requires engineering teams to invest in multiple logging and error tools that support various run-times and versions - so that alerts are raised when something goes wrong. But the real problem continues beyond that - with root cause analysis. Place your bets.

While solving infrastructure-related issues, although they are time-consuming, is generally done at a fraction of the time needed to triage and fix code-related matters.

From war rooms to well-documented processes involving developers, development leads, DevOps, and SREs, the outcome is usually the same - it's bloody challenging, time-consuming, and cumbersome experience for all involved.

As I shared in Lesson 2, having a clear sight of service ownership would be a massive step in the right direction and help speed up the coordination process when finding the right developer. However, this aside, the most significant chunk of time is still spent tracing distributed services. Distributed tracing tools can speed up the process, but their effectiveness relies on instrumentation, span, and coverage (the price tag impacts the latter two). As a result, most people we interviewed opted to trust an engineer-led approach more and deal with the consequence of being a lot slower.

People agreed that having a way to quickly view services with 1st and 2nd-degree dependencies would speed up the process and even shift them towards a preventive course of working - made possible by this single pane of glass.

Conclusion

Microservice architecture brings much value to both the overall business and engineering team. Still, this value is only sustainable if the teams invest in understanding their services landscape and dependencies. Essentially, ensure that troubleshooting production issues don't spoil any ground claimed by lightning-fast feature delivery.

A few years ago, I heard Todd McKinnon, CEO of Okta, say, "focus on obvious value for your customers"…. This has stuck with me ever since - especially in a world where we lose sight of pragmatism quite quickly. Right now, the apparent value coupled with the most significant challenge I see is helping engineering teams connect the dots between their services without the hassle of stitching together and maintaining data sources. Doing this will unlock a hive of productivity, less time spent fixing, and happier engineering teams.

Side Note:

At Kerno, we are trying to turn complicated into simple, aiming to bring developers closer to production endpoints and customers. Our beta product focuses on helping engineers/teams unify their understanding of their microservices, dependencies, and critical performance metrics, enabling better decision-making and prompt troubleshooting. 

Most popular