Monitoring applications on the cloud - Part Zero

I’ve been thinking about the impact of “cloudification” of technology infrastructure on IT operations management, and particularly on monitoring. Unfortunately, every time I wanted to write about something I feel like I need to write about a lot of other things first, just to provide the context. Monitoring as a discipline covers a surprisingly vast area. What I wanted to write about was the management/monitoring capabilities needed to manage production application running on (private of public) server instances provided as a service (aka IaaS). I’ll refer to this as “managing applications on the cloud” for brevity, and hope that it does not cause too much confusion.

So first in this post, I’ll attempt to describe the management disciplines that are relevant to managing production applications running on the cloud. Hopefully the posts to follow will make more sense with the provided context.

Log management

Access to the log files is probably the most essential management requirement for operations. To be sure, one does not need any special “tools” to view the log files. Operations can indeed have shell access to the servers, and view the log files using tools like grep and tail. However, it is safe to say that this is not good practice for number of reasons:

  • Giving shell access to production servers increase operational risks. Sure, it can be managed with access rights but that also introduces overhead and may not always work as intended.
  • It is typical for cloud applications to have many instances of the same application component running across different server instances (virtual or physical). In this type of environment, looking for errors mean accessing all the servers and looking at each of the log files, etc. Can be quite painful.
  • Logs can be quite verbose and application exceptions often consist of dozens of lines (particularly java apps). It is very difficult to process this information in a command line window.

Centralizing the logs in a searchable repository is necessary regardless of where applications are hosted, however it is even more essential when applications run on (public or private) the cloud. The solution should provide the users not only real-time access, similar to tailing a file, but also the capability to browse and query historical logs as well.

Although applications can send logs directly over the network to a central repository, this type of coupling is considered as unnecessary risk by most people. And use of connectionless protocols like UDP introduces the risk of logs getting lost. As such, aggregation of the logs often require some sort of “agent” on the server instances to ship the logs to the central repository. The agent can be basic, with minimal overhead, and simply ship the log files, or can do some filtering and parsing as well. If the applications running on the cloud, basic agent becomes more appealing as it has much less chance of impacting the performance of the applications running on the server instance.

In the enterprise world, Splunk is by far the most common solution for this purpose. Splunk universal forward is a lightweight agent that only ships the logs to the central Splunk server. Splunk can parse most log files out of the box and provides a nice user interface to work with the logs. SplunkStorm is the SaaS based version of Splunk recently came out of beta, and although it’s missing some key features so far, it seems to be catching up quickly.

Logstash ElasticSearch combination is the open source alternative. As it is often the case with open source, this option appeals to the do it yourself crowd, and requires integration of various components and some development. Papertrail and Loggly are other SaaS solutions for log management.

Server monitoring

Monitoring of server resource utilization (CPU, Memory, Disk IO, Disk space, Network IO, etc.) and processes running on the server is probably what most people think when they refer to monitoring. The information gathered is typically used for troubleshooting, fault management, performance management, capacity planning, etc.

Server based agents

Server monitoring has been traditionally done using a server based agent, but nowadays some basic information is available through the hypervisor as well. Traditional server agents typically perform “active checks”, basically periodically execute code that check availability of application components, collect resource utilization metrics, etc. Server agents provided by the enterprise vendors are known to be quite heavy (high overhead on the server as well as high administrative overhead to deploy and maintain), hence mostly unusable for monitoring server instances running on the cloud. Unfortunately, the dominant open source option, Nagios, does not fair a lot better in terms of administrative overhead. Number of new gen SaaS based monitoring providers such as Datadog, New Relic, CopperEgg provide server based agents.

Another problem with the traditional agents is that the resource utilization metrics collected once every couple of minutes often lack the granularity to debug problems, and increasing the frequency to say sub 1 minute interval may increase the load on the server, therefore may not be acceptable.

Passive agents

Going forward, use of server based agents to perform solely periodic active checks to gather data and monitor applications will likely continue to diminish. Better options have been emerging in the market. AppFirst is a SaaS based monitoring solution with an agent technology that passively collects detailed data by listening to calls made by the applications to the operating system. It can not only collect resource utilization data for the server but can also track processes individually (CPU usage, number of open files, network connections, threads etc.) with little overhead.

Log collection agents

Another option that has emerged is the use of the agent used for log monitoring to collect server monitoring data, metrics, faults, etc. as well. Splunk for instance, provides “apps" to collect and visualize resource resource utilization metrics leveraging the log processing infrastructure already in place. Logstash has the capabilities to forward performance metrics to various products such as Graphite & OpenTSDB, and services such as Circonus & Librato, and it can forward events elsewhere as well. However, Logstash lacks ability to collect the resource utilization metric data itself. Additional code, scripts, etc. would need to be deployed on the server and executed periodically to provide the data through Logstash. Some application developers embed the code to do this into their applications, dumping the data to log files periodically for Logstash to process and ship the data to a time series database or to an event repository.

Application (availability and response time) monitoring

Monitoring the application components using a server based agent can be misleading, both from availability and performance standpoints, as it does not reflect how users access the application. In addition, installing an agent to every server instance is not a viable option (at least difficult) for many organizations. As such, many organizations employ methods to monitor the availability of the applications from outside using “synthetic transactions” This approach is also referred as “agentless monitoring”.

Synthetic transactions are essentially active checks that simulate users or application components, such as requesting a web page via HTTP, resolve a host name in DNS, etc. Synthetic transactions are executed from one or more external locations. There are numerous products and services with varying strengths and weaknesses in this area. To name just a few, Nagios is probably the most popular open source solution as it can run not only standard checks but can also be extended with custom checks. Major shortcoming of Nagios seems to be that it’s quite painful to operate in scale. OpenNMS is a highly scalable open source solution typically favored by folks who need to monitor large number of servers and apps running on them with a wide selection of checks. Rackspace cloud monitoring, CopperEgg and Circonus are some of the companies offering granular (1 minute or less), API driven checks from multiple locations for most common web services. However, (AFAIK) these solutions do not offer sophisticated multi-step checks such as simulating a user login to a web app, click through several pages, fill a form, etc.

For public facing web applications Compuware Gomez and Keynote provide a somewhat different monitoring service. They execute synthetic transactions from thousands of computers and mobile devices distributed globally running actual browsers, and offer advanced scripting to simulate complex user interactions.

Although it is possible to use synthetic transactions and agent based server monitoring to the same ends, endless agent based vs agentless discussions mostly miss the mark. These capabilities mostly complement each other and both essential ingredients of a robust monitoring solution.External checks can determine problems as perceived by users more accurately and server monitoring can be instrumental in determining the cause of the problem, and preventing problems to impact users in the first place.

Application performance monitoring

Applications have their own metrics indicating the performance of the application as well as business metrics (number of users, credit card transactions,etc.). Attempts to establish standards to collect application performance data have failed. There are number of different methods to collect application performance metrics.

Extensions to server based agents

Using the server based agents to monitor the performance of application components has been the traditional approach. Most agents can be extended, either by configuration (check these ports, responses, etc.) or scripts, to check the availability and performance of the application components running on the same server instance or on other instances.

For example, there are thousands of Nagios plugins to monitor anything and everything from applications to routing protocols, and it’s straight forward to add your own plugins with custom checks. AppFirst (mentioned above) has a pragmatic approach and leverages this vast set of available Nagios plugins to monitor application availability and performance. Hyperic is another monitoring solution that provides an agent with a large set of plugins and as well as custom plugins.

Although this approach is mostly used for availability monitoring, it is also used to collect performance metrics as well. The weaknesses of this approach include lack of granularity (hence the potential to miss intermittent problems) and only simulation of a small subset of actual application transactions.

Application components collecting data themselves

For in-house developed applications, often the best application performance metrics can be collected by the applications themselves. Collected data can be pushed to another system/process, written to files, etc. As it is for the logs, sending performance metrics over the network directly to a repository is a possibility, but not without its problems. Hence, this option is particularly appealing for organizations that have already deployed a log monitoring agent like Splunk or Logstash, and have a time series data repository such as OpenTSDB or Graphite in place, and can collect and store the data easily.

Another option is using an agent specifically for this purpose. StatsD and its variants have emerged as a common solution. There are statsd client libraries in almost every language, and use of UDP protocol means no impact on the application performance. Appfirst and Datadog agents include embedded statsd daemons, enabling them to receive metrics from applications.

Agents running in application servers

Today most web applications use application servers as part of the solution. One highly successful approach to gather application performance metrics with little effort has been running an agent on the application server to monitor all application activity. Since most application traffic flows through the application servers, peeking into the application server activity can provide powerful insights into the application performance and help identifying problems. Shortcoming of this approach is that it does require deployment of an agent on the application server. The agent also introduces some performance overhead that varies depending on the application agent.

CA APM (Wily) has been the pioneer of this approach and still widely used in large enterprises. Stackify's powerful tools let developers, DevOps, and teams build and monitor applications better by combining monitoring, errors, logs, and metrics. New Relic provides this technology as a SaaS solution, making it available to the masses. For example, one can install New Relic java agent, restart the application server, and observe the performance of applications running on that application server, as well as their interactions with back end services, databases etc. within minutes. This technology can help with not only identifying operational issues, but also problems in the code, slow SQL statements, etc. AFAIK, there are no viable open source projects providing these capabilities.

Network based tools

A rather different approach is determining application performance by analyzing the network traffic. These network appliances typically mirror a port on the switches that servers are connected to (as well as some other techniques). They can analyze the traffic to figure out the performance of real user transactions as well as transactions between application components and back end services.

Fundamental appeal of this approach is that it can be deployed without any changes to the application on the server or the client side (no agents, code changes, etc.), though in practice some changes to the configuration or application code ( to be able to stitch transactions spanning multiple servers, etc.) seems to improve the quality of the analysis.

Another advantage is that, using this approach does not introduce any performance overhead as they passively process mirrored traffic. Weakness of this approach solution is that it requires a hardware device to be deployed on the network which may not always be feasible. Another problem is in virtual environments, the traffic between VMs may not go through physical switches at all if the VMs are running on the same host. In this case, the suggested solution seems to be deploying a VM on each of the physical hosts to network traffic on that host somewhat departing from its easy deployment and no overhead promise. ExtraHop provides a product that uses this approach.

Configuration management

A configuration management system is needed to deploy software, and make/track configuration changes, etc. in an automated, repeatable, testable manner. Having shell access to production servers and installing applications manually is a high risk endeavor. It is easy to introduce errors that can cause outages, and errors introduced are typically very hard to find afterwards. It is also considered a security risk, hence may not acceptable in risk averse organizations. Although it is possible to automate the process using scripts and ssh into the servers, more common approach is to have an agent running on the server.

Puppet and Chef are the most popular open source configuration management tools with large communities. Chef is also available as a hosted service. Glu and Ansible are some of the less well known alternatives with smaller communities (also open source). There are also tools more focused on application deployment like Capistrano.