You’ve heard the stories and the legends. You may even have had a long, incident-free summer, but you’re a veteran: you know that a long summer can mean a much longer winter. And you know how the cold and exposure of a long winter can burn both your products and your organization.
As the saying goes, or, as our Valyrian translator tells me the saying goes, “Incidents issi māzis.” Incidents are coming…What can you do?
You can prepare for incidents before the next one happens. To win this game, you need to marshall your allies to fortify your House’s defenses in advance and to hasten your response when incidents arrive.
To fortify your defenses, you need to organize your allies. You need to integrate your monitoring tools, service management and ticketing tools, teaming and collaboration tools, and a strong incident management tool.
Nowadays, every House … that is, every organization ... employs tools to watch over its networks, systems, information, applications, and other IT assets. Such monitoring tools can provide vital information about system health and events. [You thought I was going to say weather — didn’t you? ] These “little birds” can be valuable tools, but they can’t guarantee that you will recognize and resolve incidents before your products and services (or your organization) suffer incident-related losses.
Monitoring tools can send you alerts about false alarms, minor events that will resolve themselves, and other unnecessary information. Having to review and investigate every alert event from every one of your monitoring tools can bury your DevOps, Ops, or Support teams under an avalanche of unactionable alerts. This takes vital time that they need for investigating and resolving real problems. Storms of unactionable alerts can also blind your Watchers — so that they miss alerts that reflect important outages. It can even exhaust your forces: a Watch can become so long and hard that even the best of your force may be tempted to cry out, “My watch is ended.”
A good incident management tool can help you by parsing and filtering alert event information from all your monitoring tools and by combining it all into a single, actionable incident that integrates with your other tools.
What other tools?
Most organizations also rely on some kind of service management or ticketing system tool (or both). Service management or ticketing system allow support and development teams to access relevant data from a convenient, persistent interface. They collect and preserve information about issues, about people handling related tasks, about actions taken, and more.
And while a service management or ticketing system may “have no name,” it may also lack all the functions you need for managing incidents. Such tools aren’t usually designed to immediately identify and notify the people who need to know about incidents. They don’t typically include powerful on-call scheduling, escalation, and communication features. They cannot routinely communicate directly with other tools, analyze information from other tools to consolidate multiple alerts into single-incident tickets, or to synchronize actions and information with other tools over the course of an incident. In addition, they generally require incident responders to manually enter and update information — taking time away from investigating and resolving incidents.
Finally, it has been said — alright… this is undoubtedly the first time that anyone other than me has ever been cheesy enough to say it quite this way ...
Monitoring systems may be gold and ticketing systems steel, but “two links can't make a chain.” You also need dragons — I mean, you also need a chain of integrations and templates and collaboration tools and communications and automations and intelligent rules [that don’t drink to excess or accidentally kill their — never mind... ] to combine them all.
What other kinds of tools do you need?
Your responders need to be able to use the weapons they know: familiar technologies and processes to help them work together as easily as possible... chat operations and bi-directional integrations, video and conference bridges, status pages, and more.
Collaboration tools and processes can accelerate and facilitate the sharing of information and ideas, but they may not fully integrate with your other tools. Again, a good incident management tool can help you realise the full potential of your collaboration tools and processes.
You want to know more about incident management tools ?
Incident management tools are platforms that help you plan for, manage, and track high-priority service interruptions (and similar issues ). They also include powerful alerting and communications tools that can integrate with all your tools. They make sure that the right people are notified about incidents — before your organization suffers losses.
Powerful incident management tools support many additional features. They support the design of custom rules for analyzing and filtering incoming alert event information, they roll up custom alert events into a single incident, and they enrich incident notifications with custom data, runbooks, notes, and attachments. They can provide sophisticated on-call, escalation, and notifications features that can reach the right people to respond to incidents — and other people who need to know — wherever they are. They can automate a wide variety of communications to communicate with different people, such as responders, managers and other organizational stakeholders, and customers and other public stakeholders.
A key feature of good incident management tools is the ability to fully integrate and synchronize actions with other tools and solutions. Good incident management tools can support a comprehensive set of full-feature integrations that are easy to configure and use:
Another essential part of preparing for and automating incident response tasks is making tactical decisions about your battle formations. Good incident management tools have features to simplify decisions about the conditions and filters you want applied to incoming data in determining whether to roll them up into a single incident. They help you classify incidents according to impact, urgency, and priority level — a choice that sets resolution workflows, initiates resolution processes, and determines tasks needed for different types of incidents. For example:
Incident templates can significantly reduce the effort of preparing for incidents and the time needed to perform common incident tasks. Because time is what you need most during an incident!
“Delay, you say. Move fast, I reply. This is no longer a game for two players.”
— Lord Varys, Game of Thrones.
In major incidents, everyone in an organization — and many outside the organization — may need to know what is happening: not just your DevOps, Operations, or Support teams. Defining and automating incident communications policies and channels for responders, stakeholders, and the public can be critical when an incident occurs. Advance decisions and templates can help automate the communication channels that your organization will use during an incident — eliminating a major source of delay in setting up virtual incident war rooms. Similarly, they can eliminate problems related to time-zone differences between responding teams!
A good incident management tool also supports features that let you dictate the frequency of incident communications for other stakeholders — before incidents occur. Communications templates can ensure that status pages or other communications are regularly updated. Even if an update only says that nothing has changed, you maintain customer and public trust when you regularly deliver news to managers, customers, and others who rely on you.
Opsgenie provides all that you need to marshall your allies, manage your troops, and win at the game of incident response management. We give you easy, flexible, and powerful integrations, features, and intelligence to defend your Houses against the destructive forces of incidents — and dedicated, responsive support engineers who work 24x7 to make sure that you prevail.
So… “Incidents issi māzis. Issi ao ready?” Incidents are coming. Are you ready?
You can be! See how Opsgenie can help: