Monitoring is an important part of Dev & Ops. You need to monitor; your servers, containers, and now your serverless functions. Function as a Service (FaaS), also called “Serverless” architecture, is a relatively new concept. AWS Lambda is by far the most popular FaaS/Serverless solution in the market. It is an event driven, serverless computing service.
AWS Lambda eliminates most of the problems you experience when you manage your own servers such as:
In this blog, we will highlight the logging aspect of AWS Lambda service on a high level. AWS Lambda uses AWS CloudWatch for logging purposes. All log statements or exceptions that your functions throw are written to AWS CloudWatch. On CloudWatch, you can create a metric alarm which can send an SNS message whenever the metric value reaches the threshold you defined.
For AWS Lambda, there are four types of metrics; Invocations, Errors, Duration, and Throttles. You can find more information about Lambda metric alarms here.
In this blog, we will focus on “error” metrics, which is triggered by events such as:
Now, let’s see what problems we face at Opsgenie when monitoring our own Lambda functions. Of course, the goal is to get notified with actionable information whenever an error occurs in the Lambda functions, which are monitored as carefully as our servers. However, when monitoring our Lambda functions we faced the following issues:
Issue 1: Creation of necessary alarms. When you create a new Lambda function, no alarms are generated by default. We needed a way to create alarms automatically with the right metrics.
Issue 2: No actionable data. When an “error” metric alarm is triggered, a message is sent to an SNS topic. If you use CloudWatch to get the data sent to you in the SNS message, this does not help since there is not enough explanation on the exception’s root cause. We only get the alarm’s name, description, state history, and metric information from these messages. There is no indication of the real problem, although it exists in the logs. Here are common solutions to the problems, which are satisfying, but we decided not to use them:
For Issue 1:
For Issue 2:
As stated earlier, here are the two new Lambda functions that we implemented to account for these issues. Opsgenie runs two lambda functions together! Illustrated below are the main building blocks to these problems along with code samples. (The code will be in Java, but it should be a straightforward approach to implement all these in other Lambda supported languages).
Two Lambda functions to automate monitoring and alerting:
Create necessary IAM roles for each function. (Note that you may need to add additional actions if you use other services like S3 to get secret keys).
Metric Alarms.This function creates/updates CloudWatch alarms for all Lambda functions for every X minute(s). Alarms for these Lambda functions are configured to send an SNS message to lambdaAlarm topic if Errors > 0 for 60 seconds.
What is the benefit of this function? You no longer have to create your alarms manually. Yeah!
The limitation of this function is that as it is executed periodically; your newly created functions may not have CloudWatch alarms created immediately. So to create the alarms, you may decrease the interval or trigger the function manually.
This is what automatically created CloudWatch alarms will look like:
This function receives events from lambdaAlarm SNS topic. In this SNSEvent object, we have all the necessary data to create an explanatory alert in Opsgenie. Additional log data helps developers see what occurs before going into the log details in CloudWatch. It will automatically retrieved data from the logs and add as details to the alert.
Automatically retrieved logs are very important as they will provide a quick overview of the error’s reason for the alert.
Note: The shared code is only for demonstration purposes and should not be used in production as is.
We are happy to answer any questions you may have! Contact us here!