OpsGenie’s Incident Response Orchestration - the Whys

by Emel Dogrusoz Aug 9, 2017

OpsGenie - incident response orchestration.png

OpsGenie is extremely proud to announce our new Incident Response Orchestration platform and to tell you why we created it.

Why OpsGenie is Introducing this Platform

For years, we had this vision. We analyzed market trends and needs, and we listened to our customers to understand how they work —  and how they can work even more efficiently. We have also been doing our own incident response management for many years: we learned so much from those experiences. We knew that we needed a powerful, effective incident response orchestration platform as much as our customers did! All this helped us tremendously while we were defining our goals and designing our new system to provide a best in class solution.

Why Incident Response Management is critical in a Twitter World

The pervasiveness of social media in our world means that almost any incident has the potential to seriously damage any organization: impacting revenue, branding, image, and reputation.    

Let’s face it, we all use social media to complain about disappointing goods or services —  and about bad experiences with vendors! We all follow stories about the failures of companies and individuals: we see how even the smallest of problems can spin out of control when such stories have gone viral. 

Customer experience and social media management are more important than ever, which puts pressure on everyone in an organization —  and makes communicating and collaborating between business units essential. The days when incidents could occur without impacting everyone in an organization are over.  Any incident may involve many internal stakeholders who need information: marketing, communications, and other executives —  even the ones at an organization’s highest levels —  must be notified and updated about incidents and resolution progress.

This is why serious businesses (regardless of size) have been re-examining their incident management and response processes. This also explains the proliferation of on-call scheduling and related tools to support organizations as they encounter problems involving complex incident scenarios. Because organizations and their response teams know that they must do more —  in less time.

At this point, we need to emphasize that incidents will continue to happen —  no matter how much effort you put into preventing them from happening. That doesn’t mean that we don’t believe in incident prevention measures —  of course we do! It’s just that IT systems are becoming increasingly complex: various applications, services, and systems are being connected to each other, often in unique configurations. Sometimes there are not-so-obvious dependencies among these tools and systems.

You cannot anticipate and prevent all the problems that might possibly arise from the relationships and interactions of sophisticated IT systems, so incidents are inevitable. How you respond to such incidents is what matters! The keys to successfully responding to incidents are effective planning, automation, and communications.

Why Effective Planning, Automation, and Communications are the Keys to Successful Incident Management

Incident responders have a lot to do during incidents. They are under pressure to resolve incidents with minimal downtime. At the same time, stakeholders are demanding detailed updates about the incident and their progress resolving it. When responders have to handle such tasks manually, in the middle of a crisis, it takes longer to resolve problems —  and stresses both internal stakeholders and external customers.

Giving responders the right tools and automated processes gives them back the time they need to respond to incidents more quickly and allows your organization to maintain effective communications. Generally, it is better to automate incident communications. At the least, however, an organization needs to consider and plan for the following in advance of incidents!

  • Classifying incidents based on impact, urgency and priority levels in order to guide incident resolution workflows 
    This is the first step of any incident response process. With proper incident classification, you can:
    • Identify the best specialists (responders) to handle an incident.
    • Specify how to route and escalate incidents.
    • Determine reporting matrices (for management information).
  • Using incident templates for common issues

Templates can significantly reduce the time needed to adequately record incident details  and to properly communicate them. You can automate these tasks simply by applying the correct template for a specific type of incident.  

  • Defining roles —  who needs to be involved in incident resolution (and how) based on the priority of the incident.

Incidents may require the involvement of an incident commander, subject matter experts, stakeholders, public relations (PR) officers, and others. These people may be involved in different ways, requiring different powers and different access to information. Defining their roles in advance means that your responders don’t waste time figuring out what people need and getting them access to it in the middle of responding to an incident.

  • Defining communication channels and collaboration methods for responders

Defining communication and collaboration methods in advance makes it easier to notify and assemble responders —   and to include them (and others) in any virtual war room needed during an incident. Modern incident communication and collaboration methods include constantly-evolving technologies, such as chat tools, conference calls, or web conference tools. With the help of such tools, it no longer matters where responder team members are located —  even if they are in different time zones.

  • Providing runbooks that let responders know what to do and whom to involve

Runbooks for a specific type of incident can accelerate incident resolution by providing easy access to supporting data, such as alerts, logs, configuration changes, metrics, and dependencies.

  • Defining communication channels for stakeholders

In a major incident, everybody in an organization needs to know what is happening —  not just responder teams. At the same time, responders should not be spending valuable time and energy updating internal stakeholders, so automating communications with stakeholders is essential. Giving stakeholders a simple way to check on incident status can save the sanity of both stakeholders and incident response teams.

  • Templatizing communications

Updates are important, because even if an update just says that nothing has changed, delivering such an update can still mean a lot, in terms of maintaining trust. Decisions about how often to provide updates about incidents should be made before incidents occur. Using templates automates communications for different types of incidents —  and saves time during an incident.

In a Nutshell…

No matter how good your systems, technologies, and employees are, you will still experience outages and other IT incidents. Good advance planning, automation, and communications, however, can make the difference in how quickly and successfully you respond to and resolve such incidents. Our new platform gives you what you need to succeed.

Check out OpsGenie’s new Incident Response Orchestration platform!