Monitoring 101

You built your first website. Imagine things that could go wrong that would cause trouble for customers accessing your website. Here are a few things that could go wrong:

NGINX crashes - Webpages won't be served since NGINX dies.
EC2 instance is down - Server won't respond to any http requests.
NGINX is up but your certificate is expired - Server with an expired certificate would cause reduced customer traffic as it makes customers vulnerable to cyberattacks.

You could periodically access your website to ensure that it's functioning correctly. However, manually checking is cumbersome. Also, how often you monitor manually determines the amount of downtime your site can go through. There are systems that you can set up which automate this task for you (If you are already aware of monitoring and alerting systems like Prometheus, you can skip this section and move on to the next). Such systems have 2 components:

Monitoring - This is the component that would indicate status about some aspect of your website (e.g. Whether NGINX is up or down, whether EC2 instance is up or down, whether certificate is expired or not) in the form of a metric.
Alerting - This is the component that actually sends alerts based on rules you configure.

NOTE: Here's a book on monitoring if you're curious about the topic and want to explore it in depth.

Let's take a look at monitoring and alerting around a specific example: NGINX.

Monitoring - NGINX is up as a metric. Following graph shows the nginx_up metric emitted by Prometheus using the nginx-prometheus-exporter plugin.
Alerting - Alerting via slack message when NGINX is down. Alerts are sent by prometheus alertmanager to slack using a webhook.

Next series of tutorials will show you how we can set up monitoring and alerting using prometheus.

Have any feedback?

If you have any feedback regarding this article or need a tutorial on any specific topic, please submit this feedback form.