There’s an increasing need for process automation in IT Operations (ITOps) as a result of organizations’ digital transformation initiatives to meet customer and employee demands, as well as remote and hybrid work policies brought on by the pandemic, according to a Transposit study.
Now, ITOps and software engineering teams including DevOps and site reliability engineering (SRE) face increasing complexity in their work, leading to significantly more strain and downtime. The report contains findings regarding the impact of remote work and digital transformation on service incidents and remediation. Findings also span the adoption of automation and SRE practices within ITOps and software engineering teams, including:
- 94% of respondents increased focus on SRE practices in their organization in the past 12 months
- 42% plan to expand SRE efforts in 2021
- 86% of organizations are planning to hire SREs in the next 12 months.
528 IT Operations and software engineering professionals were surveyed in the United States at organizations with over 300 employees. The research reveals how ITOps, DevOps, and SRE teams are equipped to deal with the increased demands of modern stacks, service incidents, and issue resolution. It also assesses the actual cost of modern DevOps and the challenges in making it affordable for the mainstream. Additionally, the research revealed which barriers to automation are stunting companies in achieving modern and efficient operations.
Although the vast majority of organizations have incorporated remote and hybrid work policies and have increased digital transformation initiatives since the start of the pandemic, organizations have also been hampered by longer incident resolution, inefficient processes, and lack of automation.
“The shift to remote work, combined with significantly increased demand for cloud and digital initiatives, has stretched the resources of engineering and operations teams to their limits,” said James Governor, Redmonk co-founder and analyst. “Investments in SRE automation are a natural reaction to the situation.”
“Our study aligns with what we’ve been hearing from customers. Organizations have many manual DevOps processes that cause unnecessary toil. And, they are investing too many of their resources – including talent – on building custom in-house tools to automate an incident response process that pulls together all the parts of their software stack,” said Tina Huang, CTO at Transposit.
“Those resources could be put to better use by investing in initiatives that drive companies forward, such as product innovation or customer service, especially during a time of economic uncertainty and, for some industries, instability.”
Impact of remote work and DX on service incidents and remediation
During the pandemic, DevOps, SRE, and IT teams are becoming overburdened by the sudden acceleration in digital transformation resulting in an increase in service incidents, which impacts customers. The following survey results demonstrate the impact of remote work and digital transformation on an organization’s ability to remediate service incidents:
- 9 out of 10 organizations experienced an increase in service incidents that have affected their customers since the start of the pandemic, with nearly 60% of respondents observing a 20% increase in service incidents or more
- 93% said that incidents were taking longer to resolve while working remotely with over half reporting that incidents took between 11-30% longer to resolve than on average
- Nearly 70% saw an increase in the cost of downtime since the pandemic started.
When asked how organizations will improve their incident management process in the next 12 months to decrease mean time to resolution (MTTR), organizations showed that they are motivated to get the right tools, processes, and reliable automation in place to keep pace with innovation.
Almost all respondents believed that systematically mining insights from human data (such as archived Slack communications, postmortem interviews, group feedback, etc.) could improve future incident response and improve operational excellence.
Currently, nearly 60% of respondents say it’s hard to piece together human actions and communications that took place during an incident response.
ITOps adopts site reliability engineering
SREs are essential to any organization for solving infrastructure and operational problems. The study revealed that the acceleration of digital transformation and changing work policies fueled by the pandemic have forced organizations to prioritize SRE as a critical business function.
Even if organizations do not have formal SRE roles, ITOps teams are adopting SRE practices. The survey illustrates that SRE is going mainstream:
- 98% of respondents with the “VP/Director/Manager IT Operations” role increased focus on SRE practices in their organization in the past 12 months
- 62.4% of IT Operations respondents plan to expand SRE efforts in 2021.
SREs are critical contributors to incident resolution and help teams work with complex distributed systems at scale. However, nearly 80% of respondents said individuals responsible for reliability engineering are experiencing challenges while trying to solve incidents as they are occurring.
More than half of respondents reported that the most common challenge while taking action to resolve an incident was a lack of automation.
Key drivers of automation
The survey showed that automation would be a highly valuable tool for incident management. Organizations are still draining a significant amount of resources, time, and money on manual tasks while responding to incidents. In an attempt to solve the problem, many organizations have invested in building custom tools or bots for automation. Forty percent of organizations have one or more full time engineers working on custom in-house tools or bots for automating incident response.
Custom development is often required because most commercially available automation platforms do not allow for human-in-the-loop automation. The research revealed that 9 out of 10 respondents believe automation should let humans use their judgment at critical decision points to be more reliable and effective.
ITOps and software engineering teams are embracing automation to reduce manual processes and eliminate the toil of software development with today’s modern stack. Even so, nearly half of respondents reported that their engineering operations are only 26-50% automated. The study revealed these top barriers for automation:
- Inadequate documentation of institutional knowledge and existing processes: 51.9%
- Lack of clarity about what to automate: 47.3%
- Share of knowledge is not enough: 43.8%
Documentation is a critical component of DevOps that could move automation forward, yet it is often overlooked. When asked how better documentation, process, and availability of data during incidents impact business, respondents pointed to improved MTTR, enhanced service reliability, streamlined operations, and lower cost of downtime.
“Since the onset of the pandemic, organizations have experienced operational inefficiencies and observed an increase in MTTR and downtime which can be devastating and costly if they’re already experiencing business challenges during the pandemic,” continued Huang.
“Investing in reliable automation tools will help streamline traditionally manual processes and tasks and eliminate inefficiencies to improve operations and deliver value to customers. As companies ramp up their digital transformation initiatives and implement long-term remote work, having reliable automation in place can make every engineer as capable as the best engineer on the team and increase speed to resolution by providing repeatable and reliable processes.”