Guides

DevOps vs. SRE

Learn the difference between a DevOps Engineer and a Site Reliability Engineer (SRE).

January 18, 2023

Are DevOps and SRE the same?

The short answer is no. These are two distinct functions that focus on different parts of the DevOps lifecycle. 

It can be confusing to think of DevOps as a function of DevOps, but for the purpose of this article, unless called out specifically, the DevOps I am referring to is the function/role.  

DevOps (Development Operations) is oriented around code builds and pipelines - essentially owning the scheduling of new feature deliveries. While SRE (Site Reliability Engineering) orientates around supporting a product or service in a live production environment, coupled with generally being the first responders when any issues arise with a service. 

A practitioner of DevOps is referred to as a DevOps engineer. DevOps engineers usually graduate from software development and have a strong knowledge of code.

While in SRE, a practitioner would be referred to as a site reliability engineer, emerging from coding or an operations background like a CloudOps engineer. SREs are typically more experienced in infrastructure but have a firm grasp of code. 

The role of DevOps is well documented, so that this article will gear more toward site reliability engineers. 

The role of SREs in DevOps

The DevOps way of working has been widely accepted as the defacto culture when trying a high-performance software development lifecycle. While Developers and DevOps alike focus on, and own the design, build and deploy stages of the lifecycle, SRE work in the operation and monitoring stages, baking in resiliency to keep services running. 

SREs, when utilized correctly, can play a pivotal role in keeping the lights on - they have a deep understanding of services, how these services run, and the code that makes up these services. This makes them a respected ally of developers, whom they typically coordinate with in the event of a service issue resulting in service downtime.

What are the critical roles of a Site Reliability Engineer?

Google introduced the practice of SRE in 2008 to support the operations team in keeping pace with the increase in product velocity by ensuring services meet the required availability targets through continuously building resilience to systems and services. Since then, SRE has become a common practice in organizations globally, especially those operating with a microservices architecture or transitioning to this state. 

The critical areas for the SRE teams include:

  • Availability
  • Performance
  • Monitoring
  • Incident Response 

Availability

Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) are common languages used between product owners, operations, and SREs. All provide insight into the expected performance of an overall service and how that service is performing. For example, business-critical services, like payments services, would have a high availability expectancy (99.9%), so SREs would need to deploy plans, tools, and contingency measures to ensure this is business as usual. 

Performance

Taking the example of the payment service, SREs would continually monitor the SLI and other performance indicators like latency, ETLs, etc., and keep tabs on performance metrics that feed this service.

Monitoring and Reporting

This is an essential part of developing a strong SRE practice. Services and dependencies generate high volumes of data, and it can be impossible to track relevant changes in service performance in isolation. Therefore, creating a unified view is a key focus area for the SRE. In addition, the better the monitoring experience is, the better the SRE can perform. Therefore, data integrity and availability are essential considerations for SREs when choosing their data and monitoring partner. Once a solid monitoring practice is established, the SRE's role is to keep tabs on business-critical services while informing product owners, developers, DevOps engineers, and more relevant parties of any disruption.

Incident Response 

From our experience, this is the most challenging and stressful part of being a Site Reliability Engineer. Depending on what type of service architecture is used, when a non-business critical system or service goes down, it can be a mild inconvenience for the business. However, this is rarely the case in microservice-orientated architecture (MSOA). In MSOA, the interconnectedness of services can lead to a very complex dependency tree, meaning that if a non-business service goes down, it could also impact a business-critical system, as the relationship between both was via a proxy service. As a result, SREs spend much time continuously logging and testing for dependencies. 

If the service experiences production issues, the SREs are usually the first responder, given their intimate knowledge of the service map and their ability to digest and understand code. 

Given the access to a wide berth of information and understanding of systems, it is no wonder that SREs are great sources of knowledge for development, architecture, and operations teams. 

SRE Tooling

An SREs toolkit can be extensive, but I have summarized the key tools using the following categories:

  • Error monitoring 
  • Incident Response
  • Communication 
  • Observability 
  • Logging tools
  • Tracing 
  • Monitoring
Category Used for Author’s choice Review (G2)
Error monitoring Track and record deltas in performance to help pinpoint where a production issue is located Sentry 4.5 (65)
Incident Response An automated incident response system is a system that automates incident response tasks, such as identifying, containing, and eradicating incidents Pager Duty 4.5 (793)
Communication Real-time communication tools help the SRE to provide information in real-time to stakeholders. Slack 4.5 (30,852)
Logging Management Store event logs to help trace and reconstruct the timeline of when issues first spawned to help with the remedial process. Splunk 4.6 (346)
APM Application performance and monitoring. View the health metrics of applications and set alerts to detect any drop in performance Grafana 4.6 (86)

The tooling in the table above offers the SRE the ability to analyze various components of a system or service in real-time; The cost varies per tool. Still, enterprise-grade APM tools can run into hundreds of thousands of dollars. The good news is that many of these tools offer generous freemium packages. 

FYCK combines the main functionality of all these tools into one dedicated tool for SREs. This means SRE teams can be mobilized without a high cost or the tooling overhead, making for a quicker troubleshooting experience. 

DevOps Engineer vs. SRE Salary

As I have explored through this article, the role of a DevOps Engineer and an SRE differ. Although different accountability and responsibilities, this author wants to stress that both are equally important in the overall performance of the software development lifecycle. Therefore, comparing salaries does not indicate which role is more critical but provides more insights into the scarcity of the parts. 

From the job website, Indeed.com, the average salary for an SRE is $161,000 per annum (1), while as a DevOp Engineer, the average salary is $119,000 per annum. Another insight one can draw from the role of a DevOps engineer is now becoming more practices resulting in more knowledge around the topic and training - decreasing the barriers to entry slightly. However, as SRE is a relatively new discipline, there may be less support for folks looking for a career as a site reliability engineer. 

Worth investing in an SRE function?

If Cloud Native is your game, then the answer is yes; as you grow and scale, the responsible of site reliability with have to shift from individual development leads and cloud ops to SREs, who know this space better than anyone. 

Calculating the ROI can be looked at in several ways, but the most preferred method is MTTR (mean time to resolve)... reducing this can have a significant impact on time saved and reduce AFK (time away from the keyboard). Take your time with something going wrong; teams are already overwhelmed at that stage. 

Conclusion

DevOps is a way of optimizing overall software delivery and performance. Site Reliability Engineering (SRE) is a function that contributes to the operability and monitoring of production software in the form of services and systems. 

DevOps and Site Reliability engineers play two distinct roles in the DevOps cycle, complementing each other and making each other successful. 

SRE is a relatively new discipline, but one of the most important (or soon to become) functions in the product development lifecycle. Their access to knowledge, capacity to process vast masses of data, and intimate understanding of the overall IT ecosystem make them well-positioned to be valuable partners for all stakeholders. This author is a true SRE fanboy. 

Most popular