Data generated by monitoring systems can be used to support operational support processes in different ways; and I think it’s useful to know the distinction between the two core uses:
Mathias (@roidrage) of Travis CI has an excellent blog post on operations of a hosted product and the role alerting. It’s a good read for anyone who is in operations or would like to understand operations better. In the post, he describes not only what they are currently doing but also the challenges they face, as well as his thoughts on what they will need to do to improve.
At OpsGenie, our goals are highly relevant to the topics discussed in the post. We provide alert & notification management tools to enable ops teams to manage entire alert life cycle, what happens after an alert is generated till the problem is resolved. Since we also operate a hosted service that needs to be up and running at all times, and deal with many of the same challenges mentioned in the post, I wanted to add my 3.1415 cents as well:
Nagios is an open source IT infrastructure monitoring tool that offers monitoring and alerting for servers, switches, applications, and services. OpsGenie is an alert and notification management service that is highly complementary to Nagios. OpsGenie Nagios integration leverages the Nagios notification system to forward alerts to OpsGenie (either via email or API) and notify users via iPhone/Android push notifications, email, SMS, and phone calls. There are already many OpsGenie users taking advantage of the integration. So what does OpsGenie have to offer for Nagios users?
Most operations teams use number of disparate monitoring tools (and services) to monitor the technology infrastructure, network, systems, applications etc. These monitoring tools all have some degree of alerting. They can generate alerts when they detect problems and can send alert notifications via email, etc. Yet alerting, particularly what happens after an alert is generated differs significantly from between tools.
Operations folks at Etsy said it best with “measure anything, measure everything”. Metric (aka time series) data collection, visualization, and alerting are essential operations management capabilities. We need to be able to track not only systems metrics such as CPU and memory utilization, but also (even more so) application and business metrics such as response times, number of transactions, etc.
I’ve been thinking about the impact of “cloudification” of technology infrastructure on IT operations management, and particularly on monitoring. Unfortunately, every time I wanted to write about something I feel like I need to write about a lot of other things first, just to provide the context. Monitoring as a discipline covers a surprisingly vast area. What I wanted to write about was the management/monitoring capabilities needed to manage production application running on (private of public) server instances provided as a service (aka IaaS). I’ll refer to this as “managing applications on the cloud” for brevity, and hope that it does not cause too much confusion.
IBM Tivoli Netcool is the most common event (alerts in OpsGenie terminology) management solution used by operations, particularly in large enterprises and service providers. Since Netcool is used to collect and consolidate events from many event sources into a central repository, it makes sense to integrate OpsGenie with Netcool to add the capability to notify users for events that are important to them.
OpsGenie empowers users to control how they are notified. One of the available features is quiet hours. If the user specifies quiet hours, OpsGenie does not send notifications during these hours to the user. This feature is typically used by users who’d like normally be notified when something goes wrong but not want to wake up in the middle of the night unless they have to. But what if for some alerts they do want to be notified whenever?