As long as our applications are in production, boosting uptime and avoiding outages is the highest priority for us developers and operational teams. Despite the great care, having 100% uptime and avoiding outages is a challenging task for even the most stringent DevOps teams. Let’s imagine that one of your data centers stops responding and in-turn your email service is completely out, or your payment service has gone offline during Black Friday. Remember the AWS outage that lasted four days and affected countless numbers of cloud services in April 2011. This is a good example that outages happen even to the most secure environments.. Now what? Are you going to examine huge log files to find out what went wrong? Are you going to notify all of your operational teams and developers at the same time to investigate the cause? Unless you allocate large resources for chaos engineering like Netflix does, you most likely will have very limited time to overcome the issue. So those aren’t realistic options for most organizations.
I’ve spent many years implementing traditional enterprise IT operations management tools. Integrations among various tools are often the Achilles’ heel of management systems. Integrating various applications is often a high-risk endeavor for customers. Enterprise vendors typically charge tens of thousands of dollars for integration “plugins”, and the implementation requires highly skilled (and expensive) engineers. To make matters worse, enterprise vendors are often not keen on collaborating with their competitors, let alone collaborating to help their customers. Vendors sometimes even block these integration efforts. I’ve witnessed a vendor not selling their product to prevent them from integrating with it (how is that for putting the customer first).