Previously in our blog, we discussed why it is crucial to have a powerful incident response orchestration platform and why effective planning, automation, and communications are the keys to successful incident management. In this blog, we’ll explain how OpsGenie’s Incident Response Platform enables you to achieve these goals.
Fitting the Pieces Together:
Step 1 - Defining the incident content:
Because IT incidents and resulting downtimes are intolerable, organizations have started to employ more and more monitoring tools to quickly identify issues before the issues turn into disasters. When an organization uses multiple monitoring systems that are detecting anomalies, each anomaly can trigger separate alerts in OpsGenie (via Integrations). This can result in floods of alerts for the same or related issues — which can waste time for IT responders, who must deal with each alert individually. We introduced the concept of incidents to consolidate related alerts so that responders can focus on them as a single incident: automatically managing the related alerts in the course of managing the incident.
Of course, not every incident has the same importance or urgency. Some might impact your entire customer base (resulting in serious business loss) whereas others might be minor issues with no impact on functionality (creating only minor annoyance). To distinguish these, we added support for classifying incidents based on Priority: you can define different response workflows for incidents with different priorities.
OpsGenie simplifies the process of manually creating incidents and selecting alerts to associate with an incident. You can even use our predefined Incident Templates. Our templates can save you time entering the same information — and help you standardize response flows for specific kinds of incidents.
As an alternative to manually creating incidents, you can define incident criteria to have OpsGenie automatically create incidents based on your choices. When alerts match your predefined criteria, OpsGenie aggregates the alerts. This approach enables you to plan ahead for incidents. The speed and convenience of this feature is already loved by our customers who have tried the new platform.
Step 2 - Defining the responders:
During an incident, your main goal is returning your systems to normal, as soon as possible. In order to do this, you need to involve responders with the right skills and responsibilities, as soon as possible. OpsGenie helps you build responder teams based on skills and responsibilities and assign them ownership of systems and processes that require those skills during an incident. You can create and assign responder teams manually (when an incident arises) or you can define these teams in advance and use service-level responder templates to automate the process (as incidents are triggered).
You may need to involve additional experts for a coordinated and quick resolution, in case of a major incident. In such cases, you can use our Responder Teams feature to assign additional teams. Each team can have its own resolution workflow, with separate alerts dedicated to their own team — and they can still collaborate with others (under the umbrella of the incident).
Step 3 - Defining collaboration methods for responders:
Our new incident response platform provides all the automations and integrations you need to orchestrate your response efforts. The new Conference Bridge feature automates the process of setting up audio or video bridge calls to help you easily create virtual war rooms when an incident occurs. Responders can easily access war room information from our applications and notifications in order to begin collaborating with team members immediately.
You can also use your integrated chat tools to collaborate during incidents, just as you do for alerts. OpsGenie’s powerful bi-directional integrations with chat tools make it possible for your responders to share information and perform a wide variety of incident-related actions in team chat channels.
Step 4 - Defining the stakeholders:
Any incident may need to involve people from across your organization, such as marketing, communications, and other managers and executives. These people need information about incidents and resolution progress, even if they aren’t working on restoring systems.
You can add such stakeholders to an incident to have OpsGenie inform them about the incident. You can add them manually (when an incident arises) or you can define them and the messages to be sent to them (in advance) by using service-level stakeholder templates.
You can also provide stakeholders with information about incidents on a Service Status Page, a new feature of the platform. Responders can add updates about an incident to the service status page to make sure that stakeholders continue to have timely access to information about the status and progress of incidents.
Step 5 - Enabling effective incident postmortems:
The keys to optimizing how you respond to your next incident are recognizing that there will be a next one — and learning from the one you just resolved. We can help you with that. OpsGenie gives you the tools to facilitate effective incident postmortems. Easy access to data such as chat history, logs, configuration changes, metrics and dependencies boosts your team’s postmortem assessments and decision-making. As a result, your team can learn from past incidents, maintain runbooks with useful and to-the-point information, assess your incident response practices, and improve your effectiveness for future incidents.
Want to learn more about how that all fits together? Contact us to request a demo of the new platform!