Keith Smith joined Imagine Learning as the Principal Site Reliability Engineer after he’d already been in the DevOps space for years. He was familiar with various incident monitoring tools including OpsGenie. Imagine Learning had many tools in place, but consolidation and effective alerting were just not there.
“[At the time] the on-call team only got alert messages via email—it was stupid, there was so much noise. I would get up each night at 1 am, look at my phone and go back to bed. I set out to say there is a better way.”
Due to all the noise, the alerts weren’t meaningful and they weren’t actionable. The process was completely reactive and teams were left without an efficient way to communicate during incidents.
“Support call volume would go up, which indicated a problem, and then the rep in support would escalate it. But that was the only chain of communication— the customer would tell us something was wrong and then we would fix it.”
Additionally, before using OpsGenie the On-Call team was not always the team that could fix the issue, so people were woken up in the middle of the night for no reason.
With over 20 tools and applications to manage OpsGenie’s ability to integrate with their IT Stack was key to quieting the noise.
“Every time I have wanted to connect a source to OpsGenie, there has been a path — even if just using a webhook.”
Deep integrations with Slack and JIRA mean Imagine Learning now has an automated process. OpsGenie updates the status page creates a JIRA ticket, kicks out a Slack notification, and wakes the right people up at the right time.
“Beyond a faster MTTR, the biggest thing we gain is the communication piece, telling our customer’s what’s going on and the 500 people in our offices across the country [and world] as soon as an incident hits.”
Sharing the on-call schedule and only getting woken up when necessary enables Keith to diversify his work and empowered him to reduce response time from 24-36 hours to just a brief 15-minute window or less.
OpsGenie enabled Keith to create an efficient incident management and on-call process that reduced MTTR and also improved his team’s quality of life. For a company providing a software product, resolving an issue quickly is vital. Within 3 months of using OpsGenie, there was a 900% reduction in incident volume.
“Now we have maybe one major incident every year, work is becoming more fun. I’m able to sleep at night and it has freed up my time to work on more valuable and interesting projects.”