Incident response is the process of identifying, investigating, and responding to the issues and events that disrupt or have the potential to disrupt normal service operation. There are a handful of universal challenges with which almost every incident response team struggles. Addressing these common problems can help organizations reduce their incident resolution times, minimize cost, and prevent decay of their company’s reputation. In this post, we take a look at five of the most common of these problems.
Problem #1: Lack of context about the incident
Perhaps the most obvious problem faced by incident responders is the lack of context about the incident. When the incident lacks the contextual information, your response team struggles to understand the full scale of the problem, make the initial diagnosis, assess the priority, and communicate to the other responders, management, and customers.
With automation in place, the alerts that compose the incident can be enriched with both internal and external information. Runbooks, images, logs, and graphs for the relevant metrics can be retrieved from external tools and attached to the alerts as files. This demonstrates the context of the incident, saves your response team from wasting time gathering data, and produces faster resolution times.
Problem #2: Lack of prioritization
You never want to miss critical incidents. On the other hand, too many notifications may cause alert fatigue. Your organization has limited resources. Lack of a prioritization scheme can cause your response teams to spend most of their time on low priority alerts that do not involve any threat. This can easily result in the responders getting overwhelmed.
Establish a prioritization system — including the priority coding — and then automate it as much as possible. A sound prioritization system provides your incident response teams the ability to focus on the high-priority incidents which require the most attention and differentiate the low-priority ones which can wait.
Problem #3: Lack of tools for communicating and escalating
When a major incident occurs in your company, you need to communicate quickly and effectively. Success depends on getting the involvement of the right people and picking the right communication channels to notify them.
This is not trivial. You need to think about the process of finding the involved teams, deciphering who is on-call and consider the best channels to use to reach them: calling their phones, sending SMS, or messaging from a chat tool. If they do not acknowledge the alert quickly, you need to decide who should be notified next. It’s simply time-consuming and ineffective without a structure in place.
The right technology that provides built-in on-call schedules, escalations and reliable notifications that supports different channels — like email, mobile push, voice, and SMS — can help produce an automated process that does not require manual communications and results in faster resolution times.
Problem #4: Lack of efficient ways to collaborate
The previous problem was finding and notifying the right people. Once you notify them, the challenge is to provide a place to get them in touch to begin the resolution process: whether it’s a conference bridge, a Skype room, or a Slack channel.
Many companies are still relying on emails, spreadsheets, and other makeshift methods for collaboration. Lack of an efficient way to collaborate increases the resolution time.
The right solution will empower you to easily share the conference bridge details with the responders — even from different teams. By getting this information out to multiple responders you enable access to a common virtual room to talk about the incident and collaborate on a faster resolution.
A proper solution will also include the ability to log activities and add notes. This enables responders to view the full context of the incident and understand what’s going on before joining the discussion.
Problem #5: Lack of visibility of key stakeholders
Major incidents require more than just the resolution process. Internal stakeholders, executives, board members, customers, partners, and the community must also be informed about the status of an ongoing incident. You need to let your stakeholders know that you are aware of the issue and working on a solution.
Without any specific tools to achieve this, you may try to manually email distribution lists or publish updates in a chat room. However, all these manual processes are very time consuming and more error-prone.
By building an automated process to keep stakeholders informed you keep them aligned and dedicate more of your responder’s time to incident resolution. Note that not every incident requires public disclosure — recommended solutions support the decision to respond only to people who need to know about the issue or to be more proactive and post updates publicly. It’s best practice for solutions to also support a status page to give your stakeholders a central location to check on progress.
OpsGenie is a cloud-based alerting and incident management service that aggregates alerts from multiple IT monitoring systems and ensures the right people are notified at the right time, using multiple notification methods.