We’ve recently added support for “escalations" in OpsGenie. Escalations typically refer to notifying different users at different times until the alert is seen and processed (acknowledged) by someone, or problem is resolved and the alert is closed. If the user who gets notified first resolves the problem, or determines the problem is not urgent, etc. other users don’t have to be notified. Since escalations allow notifying only a subset of the users for alerts initially, they can be quite useful in reducing “alert (notification) noise” while still ensuring alerts don’t fall through the cracks. OpsGenie supports both “rules based” and “ad-hoc” escalations. You can create escalation rules that specify who should be notified when; You can then use the escalation rule as the recipient of an alert, instead of specifying users or groups directly. For example, the following escalation rule would notify user “fili” as soon as the alert is created, and if the alert is not acknowledged within 10 minutes, OpsGenie would notify the members of the “web_team” group.
Data generated by monitoring systems can be used to support operational support processes in different ways; and I think it’s useful to know the distinction between the two core uses:
Mathias (@roidrage) of Travis CI has an excellent blog post on operations of a hosted product and the role alerting. It’s a good read for anyone who is in operations or would like to understand operations better. In the post, he describes not only what they are currently doing but also the challenges they face, as well as his thoughts on what they will need to do to improve.
At OpsGenie, our goals are highly relevant to the topics discussed in the post. We provide alert & notification management tools to enable ops teams to manage entire alert life cycle, what happens after an alert is generated till the problem is resolved. Since we also operate a hosted service that needs to be up and running at all times, and deal with many of the same challenges mentioned in the post, I wanted to add my 3.1415 cents as well:
Nagios is an open source IT infrastructure monitoring tool that offers monitoring and alerting for servers, switches, applications, and services. OpsGenie is an alert and notification management service that is highly complementary to Nagios. OpsGenie Nagios integration leverages the Nagios notification system to forward alerts to OpsGenie (either via email or API) and notify users via iPhone/Android push notifications, email, SMS, and phone calls. There are already many OpsGenie users taking advantage of the integration. So what does OpsGenie have to offer for Nagios users?
Most operations teams use number of disparate monitoring tools (and services) to monitor the technology infrastructure, network, systems, applications etc. These monitoring tools all have some degree of alerting. They can generate alerts when they detect problems and can send alert notifications via email, etc. Yet alerting, particularly what happens after an alert is generated differs significantly from between tools.
Operations folks at Etsy said it best with “measure anything, measure everything”. Metric (aka time series) data collection, visualization, and alerting are essential operations management capabilities. We need to be able to track not only systems metrics such as CPU and memory utilization, but also (even more so) application and business metrics such as response times, number of transactions, etc.
I’ve been thinking about the impact of “cloudification” of technology infrastructure on IT operations management, and particularly on monitoring. Unfortunately, every time I wanted to write about something I feel like I need to write about a lot of other things first, just to provide the context. Monitoring as a discipline covers a surprisingly vast area. What I wanted to write about was the management/monitoring capabilities needed to manage production application running on (private of public) server instances provided as a service (aka IaaS). I’ll refer to this as “managing applications on the cloud” for brevity, and hope that it does not cause too much confusion.
IBM Tivoli Netcool is the most common event (alerts in OpsGenie terminology) management solution used by operations, particularly in large enterprises and service providers. Since Netcool is used to collect and consolidate events from many event sources into a central repository, it makes sense to integrate OpsGenie with Netcool to add the capability to notify users for events that are important to them.