ASAPLabs

Enterprise-level hosting and DevOps outsourcing service

How to Set Up Server Monitoring the Right Way

22 June 2018

Any Web application is constantly changing. Inside (new functions and technologies), and in terms of external conditions (size and activity of the audience). The quality of the application depends critically on correct and timely diagnosis. The dynamics of Web applications translates the diagnostics to a new one – a constant level. It is important not just to know the maximum about the system, but also to learn about the changes as quickly as possible. This is a monitoring task.

There are three main components of the monitoring system:

  • Status monitoring. Check the operation status of the components.
  • Monitoring of trends. Collect changes in indicators and their subsequent analytics.
  • Business monitoring. Observation of deviations in business indicators.

Status monitoring

The task of status monitoring is to constantly check all components of the system for the correctness of their operation.

For example:

  • Does the MySQL database work?
  • Is there free space on the hard drive?
  • Is Nginx responding correctly to requests?

The most popular solutions:

  • Monit is the simplest choice. It can make checks and perform certain actions in emergency situations (for example, it may attempt to re-launch a crashed process). Good for small projects.
  • Prometheus is a popular professional tool. Can do everything. It has a dimensional data model, flexible query language, and modern alerting approach.

To monitor the status of our clients’ servers, we in ASAPLabs use Zabbix – the ultimate Enterprise-class monitoring platform, as they call themselves. It is highly customizable, suitable for numerous components of different types of projects running simultaneously.

The main rule of monitoring settings is to check as many system indicators as possible. The more we know, the better.

Notifications

The main task of status monitoring is to report problems. In practice, this is usually a letter or SMS message. The effectiveness of monitoring depends on 90% on the correct notifications configuration.

First of all, it is very useful to have a dashboard with the most important metrics. Most often those are the nodes that are directly responsible for generating a response to the user’s request:

  • Web server availability
  • Databases
  • Backends

The setting of notifications usually follows these principles:

  1. Selecting the parameters. Not all settings need to be configured for notifications. Some of them are fundamental (for example, the availability of a Web server). Some auxiliary (for example, the number of open file descriptors).

  2. Setting the priority. Two groups of parameters should be distinguished:
    • High priority is a critical problem. This should be about 5% of all indicators which been neglected would result in a disaster. Usually, these are the availability of all nodes (ping), the CPU utilization rate, free space in RAM and on the hard drive.
    • Low priority problem is the one that needs to be noted and addressed. These are the metrics that can cause serious problems if not reacted to in the nearest future.
  3. Determining the triggers threshold. For many metrics, two thresholds should be selected – a regular warning (low priority, for example, 10% of available disk space), the second one – requiring an instant reaction (high priority, 1% free).

Setting up notifications is not a one-time job. It should be done constantly because priorities change and new metrics appear. Observe the rules:

  • Avoid blindness. If you received a notification repeatedly and did not react to it in any way, the notification should be disabled.
  • Avoid overflow. The monitoring system should distract you on important subjects, not be your digests about dozens of improvement ideas that should be taken into consideration.
  • Use different notification mechanisms. For example, SMS is only for the most important cases. Mail messages are of medium importance. The log file or a secondary mailbox for low priority notifications.
  • Include a notification if you are in doubt. It’s better to make sure that the notification is useless and disable it later.
  • Consider the variance of the values. Monitoring such indicators as the Load Average can be a problem. This indicator can jump out of limits several times a day simply because of the nature of the loads. Increase the threshold gradually to achieve the usefulness of the notification.
  • Do a continuous check of the notification system itself. Sometimes the mail can be broken, or the SMS will terminate the deposit. Be sure to configure the delivery of notifications by using a backup monitoring system.

For our servers, we went further and added another system to monitor Zabbix. It ensures all notification will work properly. If the primary notification system demonstrates unstable performance or fails to send a warning, the secondary system will urgently warn us about it.

Knowledge of the current status of the system is not enough to make predictions. Clearly, the problem is better to prevent than to react to it. This requires systems for collecting and storing historical data on the change in the indicators. Such systems work in the same way as status ones, but usually, they collect much more indicators and store the entire history of their changes.

The most popular solutions are:

  • Grafana – convenient and commonly used open source platform.
  • Graphite – an enterprise-ready monitoring tool that runs equally on hardware or Cloud infrastructure

Analytics and forecasts

Analytics of historical data will allow predicting the need for scaling. In addition to the usual metrics, such as CPU utilization and the amount of available memory, higher-level indicators that should be included here are:

  • The number of requests per second on Nginx, PHP, MySQL, etc.
  • The number of threads and processes.
  • The size of the queues (for example, on the mail server or the task system).
  • Time of page generation.

Trend collection systems also allow you to customize thresholds and notifications. Thresholds should be selected slightly lower than in the system of status monitoring. This will allow you to receive advance notice of possible future problems.

We are constantly tracking the performance of our servers, build graphs and analyze them to reveal trends. Our monitoring algorithm allows us to see growing trends live, foresee the prospects and react quickly. If we see the doubtless upcoming threat to the server stability, we’ll take immediate action to prevent or sustain it.

Business metrics and real-time tracking

The normal operation of all components of the application does not always mean the proper operation of the application itself. Problems such as non-working registration or incorrect link in the letter will not be reflected in the mentioned monitoring systems. Many problems can be temporary or limited. For example, the inaccessibility of social authorization system or the load speed of pages for users from a particular region.

That is why you need to monitor business metrics. Many analytics systems, such as Google Analytics, allow you to conduct a detailed analysis of historical data. However, such tools are inconvenient to use to detect deviations in real time.

There are tools for collecting simple statistics and display it live, such as ioTrack. Integration is as simple as adding a counter to certain events.

Common examples of business-level metrics that can be tracked are:

  • The activity of the audience (number of visits/views).
  • The speed of the application (time of pages load time).
  • Registrations/purchases/comments/other actions.
  • Conversions of various values (conversion of advertising campaigns in registrations).
  • High-level metrics related to business logic (for example, the number of processed photos).

Collection of such data will allow to find out deviations not only in the system operation but also in the environment. For example, spam attacks, which often dramatically change several business metrics, although they may not affect the load of the system.

It is very convenient to display several basic quantities in dashboards on separate monitors in the office. It allows not only to be aware of the problems, but also to receive “live” information about the application performance.

Closing

Remember, the monitoring task is to provide information about failures in the work of the server. It is not executed one-time, the changes must be implemented together with the changes of the application itself.