Reduce Noise in Your DevOps Toolchain

devops-toolchain-blog-image

In an ideal world, your DevOps toolchain would be highly automated for incident management and allow your teams to resolve issues at DevOps speed. An alert triggered by monitoring tools like Datadog or AWS Cloudwatch would notify on-call engineers, kick your collaboration tools into gear (ChatOps, StatusPage, etc), and automatically document the issue in ITSM and ticketing tools.

The tools themselves support this type of automation, but without the right controls in place you’ll just flood your DevOps toolchain with noise.

New call-to-action

The alerts generated by monitoring tools generally fall into four categories:

False positive - a condition in your application or infrastructure met an alert threshold, but the situation doesn’t actually require any action. This could be caused by alert rules written to be too sensitive or machine learning that hasn’t developed a good model of your environment.

Change needed - the alert is legitimate but can be handled with a change request to provision more resources or install a patch. No services are currently impacted, and normal change procedures can address the situation.

Known issue - the alert is related to a problem that was already identified and logged in your problem management tools. Again, no services are currently impacted. It simply needs to be logged for the dev team assigned to the issue.

Incident - the alert indicates a service disruption or performance degradation that is going to affect the service users.

Only alerts in the fourth category should activate the incident response toolchain. Unfortunately, many teams have a manual process for reviewing the alerts and categorizing them. This is a time-intensive process that has several negative consequences.

  • On-call engineering resources can be expensive and time is wasted with alerts that don’t need immediate action.
  • The response time for actual incidents suffers as it takes longer to identify and respond to incidents hidden in the noise.
  • When faced with a flood of noise, human nature is to start tuning it out altogether. This could lead to missed incidents or leaving out important steps like logging incidents in your ITSM tools.

The solution is to automate alert categorization before notifying responders and the rest of the DevOps toolchain. The highly flexible Opsgenie rules engine is a core feature of our Essentials and Standard plans. It allows you to tag, route, escalate, and prioritize alerts based on the source, the alert keywords and contents, the time of day, and a number of other characteristics.

Earlier this year we extended these capabilities in our Enterprise plan with the introduction of service-aware incident management features. By organizing your alerts around services, you can make direct connections between alerts and incidents that impact service delivery. This allows your team to focus on the alerts that need immediate response and prevent your collaboration tools from being flooded with noise.

We’ve long supported integrations with ITSM tools, but now Opsgenie aggregates alerts that impact services, so you can address a single consolidated incident and create just one ticket in your tracking tools.

For example, Opsgenie automatically opens incidents in Jira Ops if, and only if, alerts in Opsgenie meet the criteria of a service-impacting incident. Alerts that don’t meet your criteria will still be handled through your routing policies, but they won’t result in a Jira Ops entry. This keeps your incident tracking and collaboration tools like Slack and Statuspage from being inundated.


blog image

This new incident workflow is currently available for Jira, Jira Ops, and Zendesk. The features will be available for our other ITSM integrations soon.