Continuing on with the series of blog posts that take a deeper look at how OpsGenie can be used to alleviate alert fatigue. One of the key capabilities of OpsGenie is to enable the users to control how they would like to be notified for different alerts at different times.
Just a quick announcement about the recently launched “news" site. OpsGenie gets improvements every week, and we wanted to have a medium to share these improvements, even the small ones.
Continuing on with the series of blog posts that take a deeper look at how OpsGenie can be used to alleviate alert fatigue. Mute, acknowledge all and close all actions were specifically designed for situations where excessive alerting can hinder operations.
Concept of alert fatigue is well known in industries such as healthcare, and awareness is increasing in IT operations as well. Fighting alert fatigue has been a key design objective for OpsGenie since the beginning. In the previous post, some of the key capabilities OpsGenie provides that can be used alleviate alert fatigue were summarized. In a series of posts, I will go discuss in more detail on how these features can be used to improve alert signal to noise ratio.
OpsGenie is a sponsor of Amazon’s re:Invent conference, and we’re excited to be part of it. Looking at the list of sponsors, and list of sessions this is going to be a very high quality event.
Alerting is largely a signal to noise ratio problem - catching critical problems while trying not to drown in the sea of data. Put it in another way, we don’t want to miss any critical problems and we don’t want too many alert notifications.
OpsGenie strives to improve the lives of the alert recipients. So, let’s take a look at how OpsGenie does its part to tackle this formidable challenge:
At the last DevOpsDC meetup, the speaker was Robert Treat (@robtreat2) COO of OmniTI, and the subject was “Less alarming alerts”. OmniTI is an interesting company as they both implement large scale solutions and operate Circonus, monitoring as a service solution, hence presentation was bound to be interesting and did not disappoint.
Erik Budin of ScienceLogic has a great blog post that describes the integration of ScienceLogic with (our competitor) PagerDuty. Kudos to both parties for coming up with a well thought out, bi-directional integration that goes well beyond the alerting integration supported by many of the monitoring solutions in the market! We believe that to be able to truly enable operations teams to work effectively, monitoring and alerting integration needs to be much richer than just forwarding alerts. Hence, it’s good to see this type of effort implemented and described in detail. Erik starts the blog post with a real-world scenario that has become possible with the integrated solution:
OpsGenie client apps were for long due for an update. The latest release (version 1.5) of OpsGenie apps (iPhone/iPad/Android/HTML5) include many usability improvements based on the feedback OpsGenie users have been providing. Here is a list of some of the more visible updates:
In universities around the world, the teachers spend most of their time in the classrom doing what amounts to a monologue. Sure, the students may ask questions, and there may be some interaction but most students don’t. And even when they do, time available for questions and discussion is often very limited.
In 2013, we announched the Campfire integration via callbacks. Campfire callbacks allow OpsGenie users to push alert activity to Campfire chatrooms as messages.
Couple of weeks ago, we have announced direct integration with HipChat. We’ve been continuing to work on extending OpsGenie callback capabilities.
In operations, most of the time no news is good news. If we’re not receiving alerts from monitoring systems about problems, we tend to assume that all is well with the world. But what if we’re not receiving alerts because some part of our monitoring solution has not been working for days or even weeks? If you’ve ever found out about a problem with the monitoring systems after being asked why there was no alert for a particular problem, you know what I’m talking about. If you’re supporting a web based application or service, chances are you’re employing a monitoring service to monitor the availability of your application from the outside, preferably from multiple locations. At OpsGenie we do take advantage of external services to monitor availability of OpsGenie web UI, as well as the API end points. External web monitoring enables us to find out quickly when there is a problem with OpsGenie. In addition, OpsGenie has supported what we can “heartbeat monitoring" since the beginning. Heartbeat monitoring enables OpsGenie users to send OpsGenie periodic heartbeat messages. Heartbeat monitoring serves multiple purposes:
OpsGenie is fundamentally an alert router for operations teams. It receives alerts from operations management systems via email or API, and notifies the right people using the defined rules. OpsGenie also supports "callbacks", and can forward alert activity to external systems via webhooks. Every time an alert is created, acknowledged, commented, closed or when an action is executed by a user, OpsGenie makes a web request to the URL specified in the webhook configuration. The web request includes subset of the alert data in the body of the request in JSON format. Passed data includes the alert messages, as well as the alertId and the alias fields that can be used to retrieve the rest of the alert data via the OpsGenie Alert API. OpsGenie users can configure callbacks to be triggered for all alert data or can define matching rules to forward only a subset of alerts. Webhooks provide a very flexible way to export the alert data that is aggregated in OpsGenie, and are used in many different ways. Some example uses we’ve seen include:
Not all alerts are created equal nor they should be treated as such! Some alerts are critical and urgent and we want to receive notifications immediately using any and all notifications methods, and others can wait till the morning, or an email may be sufficient, etc. We find out it is as important for an alert notification system to NOT to wake you up unnecessarily as it is to ensure you wake up when it’s necessary. OpsGenie now puts the user in full control. Users can decide how to get notified for different alerts based on the alert data and the time of day.
Schedules and escalations are out of the beta
After a two month beta period, on-call schedules, rotations and escalations features have come out of beta and available to all Pro and Enterprise level subscribers. Several usability improvements have been rolled out based on the feedback we’ve received during the beta process. Thanks for all the feedback!
It is safe to say that monitoring tools and services universally support sending email alerts. Hence not surprisingly, creating alerts in OpsGenie via email is the most common integration method used by OpsGenie users. Based on on the feedback we’ve received from OpsGenie users, we’ve enhanced email integration capabilities to make it both easier and more flexible.
We’ve recently added support for “escalations" in OpsGenie. Escalations typically refer to notifying different users at different times until the alert is seen and processed (acknowledged) by someone, or problem is resolved and the alert is closed. If the user who gets notified first resolves the problem, or determines the problem is not urgent, etc. other users don’t have to be notified. Since escalations allow notifying only a subset of the users for alerts initially, they can be quite useful in reducing “alert (notification) noise” while still ensuring alerts don’t fall through the cracks. OpsGenie supports both “rules based” and “ad-hoc” escalations. You can create escalation rules that specify who should be notified when; You can then use the escalation rule as the recipient of an alert, instead of specifying users or groups directly. For example, the following escalation rule would notify user “fili” as soon as the alert is created, and if the alert is not acknowledged within 10 minutes, OpsGenie would notify the members of the “web_team” group.
Data generated by monitoring systems can be used to support operational support processes in different ways; and I think it’s useful to know the distinction between the two core uses:
Mathias (@roidrage) of Travis CI has an excellent blog post on operations of a hosted product and the role alerting. It’s a good read for anyone who is in operations or would like to understand operations better. In the post, he describes not only what they are currently doing but also the challenges they face, as well as his thoughts on what they will need to do to improve.
At OpsGenie, our goals are highly relevant to the topics discussed in the post. We provide alert & notification management tools to enable ops teams to manage entire alert life cycle, what happens after an alert is generated till the problem is resolved. Since we also operate a hosted service that needs to be up and running at all times, and deal with many of the same challenges mentioned in the post, I wanted to add my 3.1415 cents as well:
Nagios is an open source IT infrastructure monitoring tool that offers monitoring and alerting for servers, switches, applications, and services. OpsGenie is an alert and notification management service that is highly complementary to Nagios. OpsGenie Nagios integration leverages the Nagios notification system to forward alerts to OpsGenie (either via email or API) and notify users via iPhone/Android push notifications, email, SMS, and phone calls. There are already many OpsGenie users taking advantage of the integration. So what does OpsGenie have to offer for Nagios users?
Most operations teams use number of disparate monitoring tools (and services) to monitor the technology infrastructure, network, systems, applications etc. These monitoring tools all have some degree of alerting. They can generate alerts when they detect problems and can send alert notifications via email, etc. Yet alerting, particularly what happens after an alert is generated differs significantly from between tools.
Operations folks at Etsy said it best with “measure anything, measure everything”. Metric (aka time series) data collection, visualization, and alerting are essential operations management capabilities. We need to be able to track not only systems metrics such as CPU and memory utilization, but also (even more so) application and business metrics such as response times, number of transactions, etc.
I’ve been thinking about the impact of “cloudification” of technology infrastructure on IT operations management, and particularly on monitoring. Unfortunately, every time I wanted to write about something I feel like I need to write about a lot of other things first, just to provide the context. Monitoring as a discipline covers a surprisingly vast area. What I wanted to write about was the management/monitoring capabilities needed to manage production application running on (private of public) server instances provided as a service (aka IaaS). I’ll refer to this as “managing applications on the cloud” for brevity, and hope that it does not cause too much confusion.
IBM Tivoli Netcool is the most common event (alerts in OpsGenie terminology) management solution used by operations, particularly in large enterprises and service providers. Since Netcool is used to collect and consolidate events from many event sources into a central repository, it makes sense to integrate OpsGenie with Netcool to add the capability to notify users for events that are important to them.
OpsGenie empowers users to control how they are notified. One of the available features is quiet hours. If the user specifies quiet hours, OpsGenie does not send notifications during these hours to the user. This feature is typically used by users who’d like normally be notified when something goes wrong but not want to wake up in the middle of the night unless they have to. But what if for some alerts they do want to be notified whenever?
- Timely delivery of notifications via methods like email and SMS are not guaranteed. Carriers offer SMS delivery as “best effort” and delivery times can vary. OpsGenie allows users to use multiple methods so that they are not dependent on a single method. Note that this does not mean users will get multiple notifications since once the user views the alert, OpsGenie stops sending notifications for that alert through other notification methods.
- Combination of these methods ensures the widest coverage, enabling OpsGenie to notify anyone who has a computer or a phone.
- Different notification methods have different strengths and weaknesses.
Amazon CloudWatch provides monitoring for Amazon Web Services (AWS) and the applications that make use of AWS. There are many alternatives to collecting resource utilization metrics from EC2 instances, however when AWS services like ELB, RDS, DynamoDB, SQS, etc. are used, CloudWatch metrics play a critical role in the monitoring of the applications running on AWS cloud. One of the key capabilities of CloudWatch service is the alarms. A CloudWatch alarm can watch a single metric over a specified time period and execute automated actions based on the value of the watched metric and given threshold. The automated action may be sending emails, or calling HTTP/S end points, etc.
As Software as a Service (SaaS) solutions continue to make inroads into the enterprise, integration among disparate SaaS solutions is becoming necessary as it has been the case with on-premise applications. Zapier, a SaaS offering itself is tackling this problem. Zapier provides a platform and an intuitive web based user interface to integrate various web applications. There are already almost 90 applications that can be integrated via Zapier, and we’ve already found number of use cases to integrate various tools such as Trello and HipChat.
IT Ops folks have been using electronic devices for notifications for decades. It started with pagers on our belts and pagers got more sophisticated in time.
Alpha numeric pagers followed numeric ones that could only display a phone number; and two way pagers with tiny keyboards followed them. Pagers still get used by some operations folks but largely have been replaced by mobile phones thanks to text messaging capabilities available on almost any mobile phone. IT operations processes largely use email as the main communications method to notify users when an action is required and rely on short text messages (SMS) when there is some urgency.
EMC Smarts (Ionix) Service Assurance Manager (SAM) “tools” enable operators to execute custom actions from Smarts console interactively, and “escalation policies” enable implementation of automated responses to problems detected by Smarts root-cause analysis engines. OpsGenie is a cloud based service that provides rich alert notifications and mobile response capabilities.
Leveraging Smarts tools and escalation policies, OpsGenie extends Smarts’ root cause analysis capabilities into mobile users. When Smarts detects a critical problem that requires attention, OpsGenie notifies the users through multiple notification channels (SMS, mobile push, voice, etc.), and enables the recipients to view the alert directly from their mobile devices. Here is how it works:
Starting with version 4.2, Splunk provides alerting not only by polling and running searches on a scheduled basis but also in real-time. In the previous blog post, I had discussed the benefits of integrating Splunk and OpsGenie. In this post, I'll go over the use case of sending Splunk alerts to iPhone via push notifications as an example. Here are the steps:
Splunk is fast establishing itself as one of the must have tools for IT operations. Organizations use Splunk to consolidate machine data into a single searchable repository. Splunk provides an easy to use interface that allows users to analyze and correlate the collected data. And with the latest release Splunk now has alerting capabilities where alerts can be generated for saved searches in real-time.
OpsGenie leverages Splunk alerting and extends Splunk's capabilities into mobile devices, making operational insights driven from Splunk available to uses even when user are mobile. When Splunk detects an incident that requires attention, OpsGenie notifies the users through multiple notification channels, and enables users to view the alert directly from their mobile devices. Here is how it works:
OpsGenie has a simple Web API to interact with OpsGenie from any programming language that can make web requests. Today, we've released lamp, a command line utility to do the same. Lamp uses OpsGenie Web API under the hood and provides capabilities to create & close alerts, attach files, etc. easily from shell scripts. Lamp is a Java application, hence works on any platform that has a JVM.