Incident Response with AWS Systems Manager

incident-reponse-aws-system-manager-blog-post-image

The typical DevOps on-call engineer is responding to alerts, triaging based on service impact, troubleshooting high priority incidents, and taking action to remediate issues. Automation tools like AWS Systems Manager can be a big help in reducing some of the more repetitive work and allowing engineers to focus on the most important tasks.

With the new Automation Action feature in Opsgenie you can now directly integrate with AWS Systems Manager so on-call responders can quickly execute automation plays in AWS without leaving the Opsgenie console or mobile app. You can even configure actions policies to auto-trigger the response when new alerts meet your criteria, reducing alert fatigue and improving MTTR.

AWS Systems Manager (SSM) is a management console for AWS cloud resources. It replaced the EC2 Systems Manager in late 2017 and added the ability to manage a wider range of AWS services. SSM contains a set of tools that can be very useful for DevOps and SRE teams that respond to alerts and incidents.

This is a quick rundown of three of the SSM tools: Run Command, Automation, and Parameter Store.

New call-to-action

Run Command

The Run Command feature provides a great remote management alternative to SSH, RDP or Powershell Remoting. During incident response, it’s a great way to connect to an AWS instance and execute commands as if you were a local admin.

The advantages over SSH include:

  • Security - Commands can be restricted by user using IAM policies. There’s no need to enable remote access, which removes a potential vulnerability. There are also no SSH keys to manage and keep secure.

  • Ease-of-use - There is no need for a bastion host or logging into multiple systems before accessing the target instance. Uses simple JSON-based documents, including those developed by AWS and the user community. No specialized skills required.

  • Auditability - All commands are logged in AWS CloudTrail which can be monitored for anomalous activity and saved for compliance audits.

Any troubleshooting or remediation steps that are normally performed by connecting to the box can now be performed securely and easily with Run Command. 

Automation

The Automation feature helps you automate common or repetitive tasks, freeing up DevOps resources for higher value work like optimizing code and building more fault-tolerant infrastructure.

Workflows are written as steps in an Automation Document, a JSON or YAML formatted file that can utilize AWS Lambda functions or the Run Command features mentioned above. Built-in action types allow you to interact with AWS resources like EC2 instances and CloudFormation stacks. Automation can be quite useful for diagnosing and responding to incidents.

Let’s say your on-call engineer receives a CloudWatch alert indicating an issue. Using SSM Automation, there are all kinds of response plays that could be automated.

  • Stop, start or restart and EC2 instance while applying a software update or bug fix

  • Deploy an AMI golden image if you detect configuration drift

  • Run shell commands to check DNS resolution issues or run a traceroute

  • Run the EC2 Rescue Tool, an all-purpose troubleshooting tool created by AWS that automatically detects a list of common configuration issues and attempts to correct them

  • Retrieve inventory data, config change records, and more, for a particular instance that caused the CloudWatch alert

  • And much more...


Parameter Store

The Parameter Store feature is a centralized store that can be used for configuration data and passwords. This allows you to keep parameters separate from your code and share the parameters with your Lambda functions, SSM Automation documents, and other AWS services.

AWS Key Management Service (KMS) can be used to encrypt the parameter store so that secret data is protected. AWS IAM can be used to restrict access to the parameters to authorized users and service.


Automated Alert Response

Many Opsgenie users are already centralizing alerts from AWS CloudWatch and other application monitoring tools, and utilizing Opsgenie schedules, escalations and routing rules to make sure the right people are alerted at the right time. Now you can automate the alert response using AWS SSM Automation documents to troubleshoot or take corrective action.

SSM Automation documents can be triggered automatically by Opsgenie when alerts match your predefined policies. This helps enable “self-healing” systems, where issues are resolved without the need to notify on-call engineers or create a flood of tickets.

If you prefer to review the alerts before triggering automation, responders can investigate the alert details and then execute an action directly from Opsgenie, without logging into another platform and suffering from the “swivel-chair effect”.

We’re excited to announce that the beta version of these features is now available. Please click here to learn more and sign up for early access, and don’t forget to visit us at AWS re:invent 2018!