Monitoring AWS Lambda Functions with OpsGenie

by Jan 9, 2017 Serhat Can

Monitoring is an important part of DevOps. You need to monitor; your servers, containers, and now your serverless functions. Function as a Service (FaaS), also called “Serverless” architecture, is a relatively new concept. AWS Lambda is by far the most popular FaaS/Serverless solution in the market. It is an event driven, serverless computing service.


AWS Lambda eliminates most of the problems you experience when you manage your own servers such as:

  • Scaling
  • Capacity utilization
  • Operation system-language updates
  • Metrics or logging

In this blog, we will highlight the logging aspect of AWS Lambda service on a high level. AWS Lambda uses AWS CloudWatch for logging purposes. All log statements or exceptions that your functions throw are written to AWS CloudWatch. On CloudWatch, you can create a metric alarm which can send an SNS message whenever the metric value reaches the threshold you defined.

For AWS Lambda, there are four types of metrics; Invocations, Errors, Duration, and Throttles. You can find more information about Lambda metric alarms here.

In this blog, we will focus on “error” metrics, which is triggered by events such as:

  • Handled exceptions (e.g., context.fail(error))
  • Unhandled exceptions causing the code to exit
  • Out of memory exceptions
  • Timeouts
  • Permissions errors

Now, let’s see what problems we face at OpsGenie when monitoring our own Lambda functions. Of course, the goal is to get notified with actionable information whenever an error occurs in the Lambda functions, which are monitored as carefully as our servers. However, when monitoring our Lambda functions we faced the following issues:

Issue 1: Creation of necessary alarms. When you create a new Lambda function, no alarms are generated by default. We needed a way to create alarms automatically with the right metrics.

Issue 2: No actionable data. When an “error” metric alarm is triggered, a message is sent to an SNS topic. If you use CloudWatch to get the data sent to you in the SNS message, this does not help since there is not enough explanation on the exception’s root cause. We only get the alarm’s name, description, state history, and metric information from these messages. There is no indication of the real problem, although it exists in the logs. Here are common solutions to the problems, which are satisfying, but we decided not to use them:

For Issue 1:

  1. a) The creation of CloudWatch Alarms can be done with the deployment of our Lambda function. When we upload the function to AWS, we can create the necessary alarms; however, to reduce the complexity of the codebase we decided to take this out of the deployment processes.  
  2. b) Another solution can be to use AWS CloudTrail service which writes API calls to AWS Lambda functions to Amazon S3. We would need to listen to logs from S3, parse and act accordingly at that point. This approach is feasible if in the future, we can get related CloudTrail logs in AWS Lambda. But now it seems unnecessary.

For Issue 2:

  1. a)  One way to solve the issue is to use our CloudWatch integration.  Although, with this solution we need to see the related logs that could help developers see the real reason behind the problem.
  2. b)  Another alternative is to create a Lambda function that can query our Lambda functions logs and create an alert if it encounters an error pattern. This function can be executed periodically by using CloudWatch Events - Schedule trigger. Unfortunately, this solution has performance problems as it queries all logs and has the tendency to create redundant alerts or miss the real problems as it only looks for patterns within the logs.

OpsGenie solution to monitor Lambda functions

As stated earlier, here are the two new Lambda functions that we implemented to account for these issues. OpsGenie runs two lambda functions together! Illustrated below are the main building blocks to these problems along with code samples. (The code will be in Java, but it should be a straightforward approach to implement all these in other Lambda supported languages).

Two Lambda functions to automate monitoring and alerting:

  1. CreateLambdaCloudWatchAlarm” function: A lambda function which periodically creates/updates CloudWatch Alarms.
  2. MonitorCloudWatchLambdaErrors” function: A lambda function that receives messages from an SNS topic when a CloudWatch Alarm threshold is reached. It then creates an alert in OpsGenie with the related data.

monitoring AWS Lambda functions with OpsGenie

Before writing the real functions

Create necessary IAM roles for each function. (Note that you may need to add additional actions if you use other services like S3 to get secret keys).

  1. Create an SNS topic named “lambdaAlarm
  2. Create “CreateLambdaCloudWatchAlarm” function’s IAM role
  3. Create “MonitorCloudWatchLambdaErrors” function’s IAM role

CreateLambdaCloudWatchAlarm function

Metric Alarms.This function creates/updates CloudWatch alarms for all Lambda functions for every X minute(s). Alarms for these Lambda functions are configured to send an SNS message to lambdaAlarm topic if Errors > 0 for 60 seconds.

What is the benefit of this function? You no longer have to create your alarms manually. Yeah!

The limitation of this function is that as it is executed periodically; your newly created  functions may not have CloudWatch alarms created immediately. So to create the alarms, you may decrease the interval or trigger the function manually.

Important:

  • After deploying the code, automatically or manually, you need to create a CloudWatch Events trigger so that your Lambda function can be triggered automatically. For example, you can execute this function every 30 minutes by using the CloudWatch Events Trigger.
  • We use an “snsClient” to get lambdaAlarm topic’s Amazon Resource Name (ARN). Topic’s ARN will be used as an action point for Alarm actions, meaning that the CloudWatch will send a message to this topic by using the ARN.
  • We use lambdaClient to receive all lambda function descriptions so that we can use the function’s names as value for functionName dimension. Function name dimension is used to create CloudWatch alarms specifically for the given function name.
  • We use cloudWatchClient to create metric alarms. This is its real job.

Sample code CreateLambdaCloudWatchAlarm function

Sample code CreateLambdaCloudWatchAlarm function

This is what automatically created CloudWatch alarms will look like:

MonitorCloudWatchLambdaErrors function

This function receives events from lambdaAlarm SNS topic. In this SNSEvent object, we have all the necessary data to create an explanatory alert in OpsGenie. Additional log data helps developers see what occurs before going into the log details in CloudWatch. It will automatically retrieved data from the logs and add as details to the alert.

MonitorCloudWatchLambdaErrors function

Automatically retrieved logs are very important as they will provide a quick overview of the error’s reason for the alert.

Important

  • After deploying code, automatically or manually, you need to add an SNS event trigger to your lambda function (add a subscription to lambdaAlarm topic).
  • The received SNSEvent object’s data is placed into message fields as String. So, we parse it to an object called SnsMessage to operate smoothly on the data.
  • We have the Lambda function’s name in the SnsMessage object’s dimensions field. We extract this data to find the log group name (“/aws/lambda/” + lambdaFunctionName) for CloudWatch Logs.
  • We use awsLogClient to query related Lambda function’s logs. To get the related logs, they are filtered based on the error pattern and time: http://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/FilterAndPatternSyntax.html.
  • All SnsMessage’s and retrieved logs are added to the OpsGenie alert details as key-value pairs.
  • Alias fields are used to set-up an Alias with the Lambda function’s name to avoid duplicate alerts in OpsGenie.

Sample code MonitorCloudWatchLambdaErrors function

Great Improvements!

  • You may also receive SNS notifications if the CloudWatch alarm state turns into INSUFFICIENT_DATA or OK states. By using this value if an alarm’s status was ALARM and now OK or INSUFFICIENT_DATA, you may choose to close the alert by using the alert’s alias. This will allow you to auto-close alerts in OpsGenie.
  • But again, if ALARM messages keep appearing and you do not close the alert, new alerts will not be created in OpsGenie. Instead, the “count field” of the same alert will be increased.
  • If you operate in multiple regions, pass region name as an argument to CreateLambdaCloudWatchAlarm function. You can also get the current regions name from invoking the function’s ARN.
  • Automate the deployments of Lambda functions and its triggers.

Note: The shared code is only for demonstration purposes and should not be used in production as is.

 

We are happy to answer any questions you may have! Contact us here!