Guest Post: Incident Management Meets IT Operations

FireFighters.jpeg

In 1970, a series of devastating wildfires swept across Southern California, destroying  over 700 homes across 775 square miles in 13 days, resulting in more than $233 million in losses (over $1 billion in today’s dollars, adjusted for inflation). Thousands of firefighters from around the state and beyond responded, but found it very difficult to work together. They certainly knew how to fight fires, but lacked a common management framework that could scale up or down based on the needs of the incident. They also lacked a standardized approach for incident leadership, which extended beyond each individual fire department. Shortly thereafter, fire service leaders came together and created a new, and at that time, revolutionary system for managing incidents, capable of managing everyday fire and medical incidents to large scale incidents that make the national news. A new way of managing incidents was born that day!

IMS is an all-hazard, all-risk framework for managing incidents. Over 1.5 million fires/year in the U.S. are managed using IMS. It’s been battle tested under the most extreme conditions. It works for fires and definitely works in IT.

In the 1970’s, IMS was revolutionary because it was able to cut across the geographical, jurisdictional and cultural boundaries that separate fire departments. It offered a nondenominational framework that, in essence, served as an operating system for people within both a single fire department or when multiple fire departments needed to work together. And even though the core concepts of IMS seem simple and intuitive, adopting them wasn’t easy because change isn’t easy regardless of the company, profession, or industry.

The fire service has a strong culture, strong personalities and strong opinions, and agreeing on a way of doing anything, much less adopting IMS as “the way” of managing incidents, was a herculean feat. As a testament to how efficient and useful IMS is, it overcame all the political inertia and resistance to change that could have killed it off as a fad or “something that just won’t work for us,” as many fire departments are prone to say. IT is much like the fire service. It also has a strong culture, strong personalities and strong opinions. Any company can argue that their environment, company size or complexity is so unique as to preclude them from adopting IMS as a way to manage incidents. We know this is not true.

The Blackrock 3 Partners have a unique viewpoint into the domain of Incident Management. Collectively we bring over 100 years of Fire Service and Critical Infrastructure experience--we have literally published books on the subjects--and we pioneered the adaptation of the Incident Management System (IMS) from public safety into corporate IT environments.

We have helped our clients successfully implement IMS successfully in enterprises, service providers, and DevOps shops, across multiple industries and around the world! Each adopted IMS concepts and methods, aligned with common sense and intuition, in order to build excellent incident management response programs.

You may not think that a building on fire and an IT incident have much in common, but from an IMS perspective, there are fundamentally the same. Both fire and IT incidents occur without warning, are dynamic (i.e., both are in progress and not under control), create a negative impact of some type, and require a coordinated effort of the right people performing the right tasks at the right time to return systems to normal (i.e., a building that is not on fire or an IT environment that is not in a degraded state). The burning building and the IT incident both create downtime and the incident responders are there to bring the environment back to uptime.

When working as an organized team under strong leadership, the incident responders with technical skills can assess a dynamic and evolving situation, develop plans to resolve the issue, communicate those plans, and work together to return to uptime in shortest amount of time possible. To that end, there is a difference in responding to an incident and reacting to it. Responders are trained, organized, and disciplined in their approach to resolving an incident. They bring their experience and skill to the incident with focus and direction. Reactors, on the other hand, tend to be emotional and without discipline, either as individuals or a team. Each reactor generally has a different viewpoint on what’s important to resolving the situation. There likely is no coordination among reactors, no recognition of the importance of a team, no delegation of tasks and sharing ideas or developing solutions in an organized fashion, and no focused effort of the group as whole.

Responders are calm, cool, and collected and can think clearly under pressure. They arrive and direct the events that ultimately resolve an incident. Reactors get emotional and irrational and cannot stay focused or organized. They arrive and see an emergency not an incident. Which one are you?

Clearly, incident response is best accomplished by responders. Perhaps a good way to get your head around being one is by adapting this viewpoint from the fire service: “Fire is not an emergency to the fire department. It’s what we do.” When you dial 911 for the fire department, you expect a rapid response from a group of professionals, skilled in the art of solving whatever issue you are having on your “bad day.” IT responders, regardless of whether they are using DevOps practices, ITIL, or homegrown systems, are similar to fire fighters and should think of themselves in the same way. IT responders reduce the impact of an IT issue and restore the environment back to uptime.

Incident response is a people-to-people activity. The attitudes and demeanor of the people and how they work together as a team is vital.

 To set the stage for anyone tasked with resolving technology incidents, it is important to understand a fundamental concept: When an incident occurs, all individuals responsible for resolving the incident must shift his or her thinking and decision-making from a Peacetime posture to a Wartime posture and immediately transition from being a day to day technical resource working for the company, to being an incident responder tasked with defending the business. Make no mistake, downtime is an attack on your very livelihood, and the livelihood of everyone else in the company!

Peacetime is the steady-state environment of continuing operations that exists in non-incident mode. It’s simple. Peacetime is uptime.

Wartime is an urgent, degraded mode of operation that occurs when any application or infrastructure element experiences an issue outside the normal course of business. Wartime is downtime.

 We’ve thought quite a lot about this peacetime/wartime analogy, and realize it might be viewed as excessive and/or may give an uneasy feeling about the connotation, but the important thing to focus on is that companies build the business in Peacetime and defend the business in Wartime. In wartime, the company is in downtime and its reputation, trust and financial performance is at risk. Therefore, the people who respond to the IT incident must make a rapid change from their day-to-day Peacetime job to being a Wartime incident responder.

This doesn’t mean, however, that responders are frantic or hysterical. It means that the group understands the need to assemble quickly, get organized, stay on task, and get on with the business of resolution with urgency (not emergency!), and intensity. If you come from an agile development environment, think of incident response as a really fast and compressed sprint!

Having a responder (Wartime) mentality, however, is just the tip of the iceberg when it comes to resolving IT incidents. An excellent group of technical experts without a strong leader and a framework to organize themselves cannot resolve incidents at maximum efficiency and minimum time. Conversely, strong leadership and a framework to organize people without the right technical expertise will not solve any issue quickly or efficiently. To that end, there must exist the right mix of expertise and leadership when it comes to resolving incidents.

Important to incident response is the use of monitoring and alerting tools for the technology stack, which provides the initial information for the responders and helps to size-up the incident and identify a severity (SEV) level.

01a.png

 

About the authors

The Blackrock 3 Partners work directly with IT organizations in corporations that produce $300B of revenue and created $800B of market cap, while employing nearly 2.5 million people globally. These companies rank in the top 10% of the Fortune 500 and PwC Global 100 Software Leaders, operating globally in the Industrial, Financial Services, Consumer Products, Telecommunications and Software sectors, serving markets in North America, Europe, Middle East, Africa, Asia and the Pacific. We have delivered our Incident Management programs in nine countries across three continents.

The Blackrock 3 Partners have trained, evaluated and exercised thousands of Incident Commanders (IC) and Subject Matter Experts (SME) working in Site Reliability (SR) teams, Global Command Centers, Network Operation Centers (NOC), Emergency Operations Centers, Regional Operations Centers and War Rooms. Those incident responders staff functional teams including Site Reliability, Computer Security Incident Response Teams, Mission Critical Support, Unified Command, Operations and Engineering/Technology (Network, Database, SAN/Storage, Server, Automation, Applications).