You’ve heard the stories and the legends. You may even have had a long, incident-free summer, but you’re a veteran: you know that a long summer can mean a much longer winter. And you know how the cold and exposure of a long winter can burn both your products and your organization.
As the saying goes, or, as our Valyrian translator tells me the saying goes, “ Incidents issi māzis.” Incidents are coming…
What can you do?
You can prepare for incidents before the next one happens. To win this game, you need to marshall your allies to fortify your House’s defenses in advance and to hasten your response when incidents arrive.
- Integrate tools that help you protect yourself and orchestrate incident response efforts.
- Design rules and templates that define and classify incidents and incident workflow.
- Automate Incident Communications.
Integrate tools that help you protect yourself and orchestrate your incident response efforts
To fortify your defenses, you need to organize your allies. You need to integrate your monitoring tools, service management and ticketing tools, teaming and collaboration tools, and a strong incident management tool.
Nowadays, every House … that is, every organization ... employs tools to watch over its networks, systems, information, applications, and other IT assets. Such monitoring tools can provide vital information about system health and events. [You thought I was going to say weather — didn’t you? ] These “little birds” can be valuable tools, but they can’t guarantee that you will recognize and resolve incidents before your products and services (or your organization) suffer incident-related losses.
Monitoring tools can send you alerts about false alarms, minor events that will resolve themselves, and other unnecessary information. Having to review and investigate every alert event from every one of your monitoring tools can bury your DevOps, Ops, or Support teams under an avalanche of unactionable alerts. This takes vital time that they need for investigating and resolving real problems. Storms of unactionable alerts can also blind your Watchers — so that they miss alerts that reflect important outages. It can even exhaust your forces: a Watch can become so long and hard that even the best of your force may be tempted to cry out, “My watch is ended.”
A good incident management tool can help you by parsing and filtering alert event information from all your monitoring tools and by combining it all into a single, actionable incident that integrates with your other tools.
What other tools?
Service management and ticketing tools
Most organizations also rely on some kind of service management or ticketing system tool (or both). Service management or ticketing system allow support and development teams to access relevant data from a convenient, persistent interface. They collect and preserve information about issues, about people handling related tasks, about actions taken, and more.
And while a service management or ticketing system may “have no name,” it may also lack all the functions you need for managing incidents. Such tools aren’t usually designed to immediately identify and notify the people who need to know about incidents. They don’t typically include powerful on-call scheduling, escalation, and communication features. They cannot routinely communicate directly with other tools, analyze information from other tools to consolidate multiple alerts into single-incident tickets, or to synchronize actions and information with other tools over the course of an incident. In addition, they generally require incident responders to manually enter and update information — taking time away from investigating and resolving incidents.
Finally, it has been said — alright… this is undoubtedly the first time that anyone other than me has ever been cheesey enough to say it quite this way ...
Monitoring systems may be gold and ticketing systems steel, but “two links can't make a chain.” You also need dragons — I mean, you also need a chain of integrations and templates and collaboration tools and communications and automations and intelligent rules [that don’t drink to excess or accidentally kill their — never mind... ] to combine them all.
What other kinds of tools do you need?
Collaboration tools and processes
Your responders need to be able to use the weapons they know: familiar technologies and processes to help them work together as easily as possible... chat operations and bi-directional integrations, video and conference bridges, status pages, and more.
Collaboration tools and processes can accelerate and facilitate the sharing of information and ideas, but they may not fully integrate with your other tools. Again, a good incident management tool can help you realise the full potential of your collaboration tools and processes.
- You can integrate chat operations tools! Alerts will be posted in chat channels. Your responders can access and share information and easily perform a variety of actions in chat channels. A powerful chat operations integrations can both synchronize these actions with your other integrated tools and post updates (from other integrated tools) in your chat channels.
- You can also integrate voice and video conference bridge tools to set up meetings with the people who need to discuss an incident — and automatically send access information to them.
You want to know more about incident management tools ?
Incident management tools
Incident management tools are platforms that help you plan for, manage, and track high-priority service interruptions (and similar issues ). They also include powerful alerting and communications tools that can integrate with all your tools. They make sure that the right people are notified about incidents — before your organization suffers losses.
Powerful incident management tools support many additional features. They support the design of custom rules for analyzing and filtering incoming alert event information, they roll up custom alert events into a single incident, and they enrich incident notifications with custom data, runbooks, notes, and attachments. They can provide sophisticated on-call, escalation, and notifications features that can reach the right people to respond to incidents — and other people who need to know — wherever they are. They can automate a wide variety of communications to communicate with different people, such as responders, managers and other organizational stakeholders, and customers and other public stakeholders.
A key feature of good incident management tools is the ability to fully integrate and synchronize actions with other tools and solutions. Good incident management tools can support a comprehensive set of full-feature integrations that are easy to configure and use:
- Outgoing, inbound, and bi-directional integrations.
- Email-based integrations.
- API integrations.
- Heartbeat integrations.
- Action mapping and synchronization features.
Customizing rules and templates to associate, analyze, and classify incident information
Another essential part of preparing for and automating incident response tasks is making tactical decisions about your battle formations. Good incident management tools have features to simplify decisions about the conditions and filters you want applied to incoming data in determining whether to roll them up into a single incident. They help you classify incidents according to impact, urgency, and priority level — a choice that sets resolution workflows, initiates resolution processes, and determines tasks needed for different types of incidents. For example:
- Assigning the correct specialists to handle an incident.
- Applying appropriate routing and escalation rules and policies.
- Reporting relevant management information. Implementing specific incident and communication templates.
Incident templates can significantly reduce the effort of preparing for incidents and the time needed to perform common incident tasks. Because time is what you need most during an incident!
“Delay, you say. Move fast, I reply. This is no longer a game for two players.”
— Lord Varys, Game of Thrones.
Automate Incident Communications
In major incidents, everyone in an organization — and many outside the organization — may need to know what is happening: not just your DevOps, Operations, or Support teams. Defining and automating incident communications policies and channels for responders, stakeholders, and the public can be critical when an incident occurs. Advance decisions and templates can help automate the communication channels that your organization will use during an incident — eliminating a major source of delay in setting up virtual incident war rooms. Similarly, they can eliminate problems related to time-zone differences between responding teams!
A good incident management tool also supports features that let you dictate the frequency of incident communications for other stakeholders — before incidents occur. Communications templates can ensure that status pages or other communications are regularly updated. Even if an update only says that nothing has changed, you maintain customer and public trust when you regularly deliver news to managers, customers, and others who rely on you.
Winning the Game
OpsGenie provides all that you need to marshall your allies, manage your troops, and win at the game of incident response management. We give you easy, flexible, and powerful integrations, features, and intelligence to defend your Houses against the destructive forces of incidents — and dedicated, responsive support engineers who work 24x7 to make sure that you prevail.
So… “Incidents issi māzis. Issi ao ready?” Incidents are coming. Are you ready?
You can be! See how OpsGenie can help:
- Sign up for a free 14-day OpsGenie trial.
- Visit our Community forums to engage with like-minded professionals and share experiences about supporting complex systems and other incident management topics. We welcome your participation and feedback — and your opinion about the role OpsGenie should play in the upcoming incident equivalent of Game of Thrones!
- Explore in our DevOps Playground: a sandbox environment where realistic simulations let you try out OpsGenie integrations with leading tools.