Everyone is shoulder to the wheel
Teamwork: because none of us is as smart as all of us!
High-five for synergy
Any other crass collaboration memos seen on company boards?
Before we go any further, I am not about to spread anarchy against every collaboration belief. I am generally pro-collaboration when it makes sense and is not counterproductive. It can have a significant impact when done in the right place and format, but in this piece, I will delve into areas where it's doing more harm than good.
So, I lied. I am about to spread some anarchy -
The tale of troubleshooting collaboration while troubleshooting
As Mike sits at his desk, working on new cool features that would save massive compute cycles for metadata extraction services, he receives an email from their CTO addressed to the entire technology team;
TLDR: we are moving to microservice orientated architecture! The transition will take 12 months and commence in the new year.
It's the announcement Mike has been waiting anxiously for. Alas! True autonomous workflows. No waiting around to ship. More agility. More freedom to be creative with code and work on dynamic team projects. At least that is what all the experts are tweeting about (or X'ing about.. I'm still not sure what woke is these days) - Today is a good day for Mike! He and colleagues clink beers at the bar "To microservices!".
Fast forward 18 months, and Mike's hopes are slowly fading. As a backend engineer, the work is mounting, and he spends well over 80% of this time away from designing systems and writing beautiful code to constantly troubleshoot constant application performance issues.
He is learning the harsh reality that there are always trade-offs. The benefits of shipping faster, which only applies to the frontend teams at the moment, reliability, and scale come at a hefty price – complexity.
Systems are ever-changing – visibility lines between services, applications, and ownership are blurred. While the previous processes, coupled with monitoring and logging tools designed for a monolithic architecture, are only adding to the woes.
He is spending more time on calls with DevOps, SREs, and Security to resolve issues than actually doing the job he was hired to do. He and some of his other backend colleagues reckon that running a monolith was more enjoyable… things were more controlled, and yes, shipping was slower, and issues would still arise. Still, at least in that scenario, it was expected.
However, Mike doesn't mind the challenge of solving how things went wrong – but the processes add unnecessary stress to the situation. One incident caused a severe IT outage – the kafka service was not feeding one of the authentication services, making it inaccessible to users.
It was a relatively straightforward fix once it was discovered – but the stress of the situation was magnified by collaboration-as-default epidemic riffling through the IT department.
New IT policy enforces that a collaborative resolution process is the only way forward when something goes wrong. A war room starts each time, and based on the severity, the CTO and various VPs would join the meeting along with Security, DevOps, Senior Developers, and Team Leaders.
All piling into a physical or virtual room where myriads of dashboards contain individual system health would be the backdrop to the investigation. A process of elimination starts – and once Infrastructure or Security issues are dismissed as a possible root cause, attention turns to the application layer in the form of internally built services and 3rd party services.
This is where the real guessing begins, and much speculation is done before Mike is called upon to save the day. After all, he has some knowledge about the inner workings of the app layer.
In the meantime, customers are becoming more agitated, resulting in the CEO breathing down the neck of the CTO! The neck breathing continues down the chain.
All eyes are on Mike. It's like trying to diffuse a bomb with 100 eyes on you, and all are competing to hand you the only set of pliers and hold the torch – they add little value and only add to the stress of the situation.
Mike continues to stitch logs and numerous data sources together before inspecting code and running traces. Four hours later, he found the issue, and it takes 30 minutes to fix once found.
That was a close one! Good job, everyone. Mike rolls his eyes.
After the day's stress, Mike heads to the bar to meet some colleagues to dissect the day's events. His colleagues had similar days in the past few weeks, and all concluded that the existing process is overwhelming and unsuitable for their new microservices-orientated way of doing things.
The real enemy of the day is collaboration-as-default. Unanimously, the group felt having more cooks in the kitchen resulted in a more stressful situation. They equate it to a problem where you have a plumbing issue in your house, and you end up calling the builder, electrician, architect, and carpenter all to assemble, look at the leak and give their opinion. At the same time, the plumber is the only one with the skills to find and fix the issue.
They all leave the bar with the same conclusion when it comes to troubleshooting application issues; collaboration is an abused term compensating for poor internal processes and disjointed tooling.
Although collaboration is always well-intentioned, in some instances, it's created to fill a gap, buying management time to come up with better ideas. While sometimes empowering autonomous processes would be far more efficient.
Definition of collaboration
So, I am not anti-collaboration. In knowledge economies, collaboration is necessary when designing highly advanced and complex systems where a diverse mix of heterogenous skills is required.
And, so we are all on the same planet, let's anchor the definition of collaboration.
According to the Oxford Dictionary, it is:
the act of working with another person or group of people to create or produce something
In the case above, collaboration around the process of elimination of where the area might be, but once we understood that the problem exists in the application layer, heterogenous collaboration becomes unnecessary and expensive.
In this example, that 4-hour war room could have resulted in $1,000 to $5,000 of opportunity costs.
Our lessons about collaboration
When we set out to build Kerno, a collaboration feature was paramount to include in our MVP, following the trailblazers like Miro, Figma, and Notion or so we thought...
However, after speaking to 100+ engineering leads and developers, that beautifully designed collaboration feature was scrapped.
During our interviews, everyone said that collaboration is essential. However, digging deeper, we concluded that this was only a surface-level request. What people want is a shared understanding and visibility.
People weren't looking for another way to collaborate; they were perfectly happy with Slack. But what they were missing was a shared understanding of a situation. Essentially, engineers want to know if my service is affected or if my service is the root cause of the issue. If no to both, they wanted to continue coding. If yes to either, they want to know where to go next and who to inform so they can limit exposure and get to a resolution quickly.
However, more collaboration-based tools were implemented, forcing people to stop and collaborate to comply with policy when all that was missing was a way to communicate impact and blast zone.
Microservices and autonomous teams/engineers need to go hand in hand. This pursuit is to drive better productivity and meet customer demand by shipping more often. The downside is that errors are more frequent, and so is downtime.
It begs the question, why, in these situations, do we opt for a congregation and turn an undesired situation into an even more expensive, stressful, and massively unproductive one?
Just as in a symphony orchestra, you don't witness all the musicians converging on the conductor every time a note needs to be played. Instead, a harmonious arrangement exists where each member knows their part perfectly, guided by the conductor's direction. The orchestra's success results from meticulous preparation, with every musician empowered by their role and the collective process carefully designed by the conductor and composers.
Relating this to incident response, the key is incident ownership and empowering that person to do so without additional noise.
What does the 9/10 troubleshooting experience look like for the microservices era?
A notification appears from your tray: something is off with one of your team's services. It indicates that a specific service route is having latency issues due to a database query taking longer than usual to respond — but nothing has changed on your side!
After clicking on it, you are taken to a hyper-contextualized, highly-interactive screen of everything that works in coordination with that service and… look at that! There has recently been a change in a seemingly unrelated service that uses the same database cluster, and the cluster is now under stress.
You are also informed that the team leads and the contributors involved in such change have already been notified, and a collaboration space with all the contextual information needed to resolve this issue jointly has already been put in place.
A fix is released twenty minutes later, and all systems normalize.
Our focus should be on creating a seamless way for people to gain a shared understanding, and after that, if contribution is needed, people can be brought up to speed quickly.
An investment in the right direction, I think is spending time streamlining and standardizing how event and incident information is correlated and presented — essentially creating a single pane of glass, so everyone is one the same page, looking at the same data structures and going through the same flow.
When engineers need to figure out which logs to look at, switch between multiple tools, and manually correlate information, they risk missing key information and prevent other engineers from contributing effectively as they lack a familiar context.
You wouldn't expect two chefs working on the same dish but using different recipes to be aligned. It's no different when engineers are trying to troubleshoot.
Finally, creating a shared understanding and notifications are not mutually exclusive. Relevance should be baked into notifications to prevent unnecessary notifications, especially if errors are becoming more frequent.
Collaboration-as-default for troubleshooting in microservices causes more harm than good. Although well intended, the hidden costs add up.
Engineering leaders should focus on creating a dynamic and relevant shared understanding that informs people when actions are needed or to sit tight.
This approach would save valuable resource time that could be spent elsewhere and, more importantly, foster trust across the team while reducing stress levels in IT downtimes.
Also, be wary of collaboration warriors - ask them why all the time :)