We’ve recently added support for “escalations" in OpsGenie. Escalations typically refer to notifying different users at different times until the alert is seen and processed (acknowledged) by someone, or problem is resolved and the alert is closed. If the user who gets notified first resolves the problem, or determines the problem is not urgent, etc. other users don’t have to be notified. Since escalations allow notifying only a subset of the users for alerts initially, they can be quite useful in reducing “alert (notification) noise” while still ensuring alerts don’t fall through the cracks. OpsGenie supports both “rules based” and “ad-hoc” escalations. You can create escalation rules that specify who should be notified when; You can then use the escalation rule as the recipient of an alert, instead of specifying users or groups directly. For example, the following escalation rule would notify user “fili” as soon as the alert is created, and if the alert is not acknowledged within 10 minutes, OpsGenie would notify the members of the “web_team” group.
Data generated by monitoring systems can be used to support operational support processes in different ways; and I think it’s useful to know the distinction between the two core uses:
Mathias (@roidrage) of Travis CI has an excellent blog post on operations of a hosted product and the role alerting. It’s a good read for anyone who is in operations or would like to understand operations better. In the post, he describes not only what they are currently doing but also the challenges they face, as well as his thoughts on what they will need to do to improve.
At OpsGenie, our goals are highly relevant to the topics discussed in the post. We provide alert & notification management tools to enable ops teams to manage entire alert life cycle, what happens after an alert is generated till the problem is resolved. Since we also operate a hosted service that needs to be up and running at all times, and deal with many of the same challenges mentioned in the post, I wanted to add my 3.1415 cents as well:
Nagios is an open source IT infrastructure monitoring tool that offers monitoring and alerting for servers, switches, applications, and services. OpsGenie is an alert and notification management service that is highly complementary to Nagios. OpsGenie Nagios integration leverages the Nagios notification system to forward alerts to OpsGenie (either via email or API) and notify users via iPhone/Android push notifications, email, SMS, and phone calls. There are already many OpsGenie users taking advantage of the integration. So what does OpsGenie have to offer for Nagios users?