You all know I started my career with Zabbix as my main monitoring tool. However the deeper I go into the devops world, architecture that can scale and respond to events dynamically is a common occurrence. So common that the traditional way of monitoring is beginning to show its age. Unfortunately it's time to search for an alternative to Zabbix.

The traditional way of monitoring involves pushing data to a monitoring server using an agent installed on the target instance. Not only that, the agent registration process before the agent starts its monitoring duties, can be cumbersome with ephemeral resources.

What are ephemeral resources? Basically if you practice blue green deployments, you are creating a new environment running the latest code base and destroying the former environment frequently. Your instance do not have a lengthy production life, if your development cycle has a one week cycle, your spinning up new instances to house the new code base every week.

If you're using a Zabbix agent on your instances or virtual machines, you'll need to register those resources frequently with the Zabbix server. Not very elegantly handled and could put a strain on Zabbix depending on how often you're deploying new versions of your code into the environment.

This is where Prometheus shines! Prometheus uses a pull method to grab metrics, where as Zabbix uses a push. A component of Prometheus, called node exporter, exposes metrics on a server which will allow Prometheus to scrape the data.

It may not be as pretty to look at as Zabbix out of the box, but allows us to handle ephemeral instances as well as persistent ones much better than Zabbix. Also with the integration of Grafana, I can create fancy dashboards to display the same metrics as Zabbix if not better within an aggregated view.


Overview

I'll be hosting Prometheus on my virtualization server until I'm comfortable enough to host it in the cloud. I'll explore two different setups, the first will be installing Prometheus on a virtual machine. The second install will utilize docker containers. Why am I exploring two different installation methods? I simply want to learn what makes Prometheus tick which is somewhat obfuscated when applications are containerized. Installing on a virtual machine takes more time than pulling and running a docker container, but it shouldn't be an issue for an amateur professional linux admin like myself.

diagram

In the diagram above, I will be able to access Prometheus's front-end, but it's very limited which is why I'll be using Grafana to create the dashboards. I'll be installing Grafana and Prometheus on Debian 10 virtual machines.

Prometheus will scrape metrics from the targets (app 1, app 2, and app 3) on port 9100 for linux or 9182 on windows machines. To get the metrics, I'll use a node exporter which will expose the metrics for Prometheus to pull on the target hosts.

Server IP Addresses

 -----------------------------
| Prometheus | 192.168.12.74  |
 -----------------------------
| Grafana    | 192.168.12.132 |
 -----------------------------
| App 1      | 192.168.12.1   |
 -----------------------------
| App 2      | 192.168.12.33  |
 -----------------------------
| App 3      | 192.168.12.14  |
 -----------------------------

Installing Prometheus

Before I download the binaries I'll prepare the virtual machine.

  1. Create a directory for prometheus

I like to store Prometheus in /appl.

mkdir /appl

2. Create a Prometheus user

useradd --no-create-home --shell /sbin/nologin prometheus

3. Download the binaries to the virtual machine

To get the binaries to install prometheus, you will want to grab the binaries from here. You will want to download the linux binaries for Prometheus and Alertmanager. Alertmanager is the sub component of Prometheus that will send alerts to email, instant messaging apps, and paging software.

download

wget -O /appl/prometheus-2.14.0.linux-amd64.tar.gz https://github.com/prometheus/prometheus/releases/download/v2.14.0/prometheus-2.14.0.linux-amd64.tar.gz

wget -O /appl/alertmanager-0.19.0.linux-amd64.tar.gz https://github.com/prometheus/alertmanager/releases/download/v0.19.0/alertmanager-0.19.0.linux-amd64.tar.gz

4. Decompress the tar files

tar xvf prometheus-2.14.0.linux-amd64.tar.gz 

tar xvf alertmanager-0.19.0.linux-amd64.tar.gz

5. Create a symbolic link in PATH

Now we need to create a symbolic link (symlink) in one of the directories specified in PATH so that the Prometheus and Alertmanager executables are engaged by the shell.

The three executables I will need are promtool, prometheus, and alertmanager. I will be placing the symlink in the /usr/sbin directory and pointing it to the respective executables.

ln -s /appl/prometheus-2.14.0.linux-amd64/prometheus /usr/sbin/prometheus

ln -s /appl/prometheus-2.14.0.linux-amd64/promtool /usr/sbin/promtool

ln -s /appl/alertmanager-0.19.0.linux-amd64/alertmanager /usr/sbin/alertmanager

6. Change the ownership of the /appl directory

chown -R prometheus: /appl

The command above will change the ownership of the directories and every directory inside of it recursively. The new owner will be the prometheus user I created earlier.

7. Creating the prometheus service

In this step I'll create a systemd service to so that prometheus can be started by issuing a systemctl start prometheus, instead of invoking the prometheus script and supplying it the necessary parameters.

Create a new file for the prometheus service:vim /etc/systemd/system/prometheus.service

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/sbin/prometheus \
    --config.file /appl/prometheus-2.14.0.linux-amd64/prometheus.yml \
    --storage.tsdb.path /appl/prometheus-2.14.0.linux-amd64/ \
    --web.console.templates=/appl/prometheus-2.14.0.linux-amd64/consoles \
    --web.console.libraries=/appl/prometheus-2.14.0.linux-amd64/console_libraries
StandardOutput=/var/log/prometheus/log_out.log
StandardError=/var/log/prometheusg/log_err.log

[Install]
WantedBy=multi-user.target
prometheus.service

Next I'll do the same for alert manager: vim /etc/systemd/system/alertmanager.service

[Unit]
Description=Prometheus Alert Manager
Wants=network-online.target
After=network.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/sbin/alertmanager \
	--config.file /appl/alertmanager-0.19.0.linux-amd64/alertmanager.yml \
	--storage.path /appl/alertmanager-0.19.0.linux-amd64/data
Restart=always

[Install]
WantedBy=multi-user.target
alertmanager.service

8. Start the services

Once you have created both service files. Reload the systemd daemon before starting up the services, systemctl daemon-reload.

Now start Prometheus followed by Alert Manager.

systemctl start prometheus && systemctl start alertmanager

After this step, it's a good idea to enable these services so that if the Prometheus server is restarted, they will start up as well.

9. Verify if Prometheus's UI is accessible

If you go to Prometheus's IP address and on port 9090 (ex. 192.168.12.74:9090), you should be see the UI.

prometheus-default

Now check what metrics are exposed by Prometheus for it's own consumption. On your browser open another tab and go to <IP_ADDRESS>:9090/metrics. You will see all the default metrics exposed by Prometheus.

port-9090

Looks good so far, however for now, skip verifying if Alert Manager is accessible from the UI. I'll come back to it later.


Installing Node Exporter

Now it's time to install node exporter. Download node exporter from here. Node exporter will expose a target system's metrics and allow Prometheus to scrape it. SSH to the target host, in this case App 1. My App 1 is my virtualization server, hence the ".1" IP address.

  1. Download node exporter
wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz

Decompress the node exporter tar file.

tar xvf node_exporter-0.18.1.linux-amd64.tar.gz

2. Create a node exporter user

useradd --no-create-home --shell /bin/false node-exporter-user

3. Move the decompressed directory and change the ownership

mv node_exporter-0.18.1.linux-amd64 /usr/local/bin
chown -R node-exporter-user: node_exporter-0.18.1.linux-amd64

4. Create a systemd service for node exporter

vim /etc/systemd/system/node-exporter.service

In the service file put in the following:

[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node-exporter-user
Group=node-exporter-user
Type=simple
ExecStart=/usr/local/bin/node_exporter-0.18.1.linux-amd64/node_exporter

[Install]
WantedBy=multi-user.target
node-exporter.service

5. Reload the systemd daemon and start the node exporter service

systemctl daemon-reload && systemctl start node-exporter

6. Verify the metrics are exposed

When you curl the port node exporter is listening on (9100), you should see a similar output to what you saw when you accessed the metrics for Prometheus.

curl http://localhost:9100/metrics

7. Configuring Prometheus to scrape the new target instance

Now it's time to tell Prometheus which target to scrape. Hop back into the Prometheus server and open up the Prometheus configuration file (vim /appl/prometheus-2.14.0.linux-amd64/prometheus.yml).

First edit job_name: 'prometheus', under the static_configs for targets, add ['localhost:9090'] this will tell Prometheus where the metrics are located for itself.

Second, create a new job_name: and give it the value 'borg'.  For the targets give it the IP address and port that node exporter is using. In my case ['192.168.12.1:9100'].

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'borg'
    static_configs:
      - targets: ['192.168.12.1:9100']                                   
prometheus.yml

8. Restart prometheus

After making the edits to prometheus.yml go ahead and restart the Prometheus service.

systemctl restart prometheus

9. Verify Prometheus can see the target host

Refresh your web browser and enter the following promQL query: node_network_receive_bytes_total{job='borg'}.

The query is looking at the bytes received by the target host in this case my target host is named "borg".

{job='borg'} filters out other datapoints so that only datapoints belonging to borg appear.

graph

If you need to add other targets, edit the prometheus.yml and a new job and point Prometheus to it by supplying the IP address of the target.


Configuring Alerts

Before starting on Grafana, I'll configure Alert Manager!

  1. Enable alert manager in the Prometheus configs

Head back into the Prometheus server and edit the prometheus.yml file. Under the # Alertmanager configuration section change the targets for the actual IP address of alert manager. Mine is installed on the same Prometheus server so I will use localhost instead of the IP.

Further down in the rules section. Uncomment the line for the first_rules, put the path of where the rules file will be saved. I will name mine alert.rules and place it in /appl/rules directory.

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "/appl/rules/alert.rules"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'borg'
    static_configs:
      - targets: ['192.168.12.1:9100']                                                                                                               
prometheus.yml

2. Create a directory for rules

mkdir /appl/rules

3. Create the rules file

vim /appl/rules/alert.rules

Write the following alerts:

groups:
- name: borg_high_cpu
  rules:
  - alert: cpuUsage
    expr: (node_load1{job='borg'} * 100) > 95
    for: 1m
    labels:
      severity: critical
    annotations:
      sumary: High CPU!!!

This alert will be triggered when borg's CPU utilization is greater than 95% for 1 minute.

4. Give Prometheus a restart (systemctl restart prometheus) and then check http://localhost:9093 on my web browser, you should see Alert Manger's UI.

alertmanager


Installing Grafana

Now that Prometheus is up and running it's painful to view all of the metrics with just plain old Prometheus. I wish there was a way to view multiple metrics at once! Well there is and it's called Grafana!

I'll be following the official Grafana installation guide for Debian/Ubuntu.

  1. Install prerequisites
apt install adduser libfontconfig1 -y

2. Download the Grafana Debian package and install it

wget https://dl.grafana.com/oss/release/grafana_6.5.1_amd64.deb

dpkg -i grafana_6.5.1_amd64.deb

3. Open port 3000 in the firewall

iptables -I INPUT -p tcp -m tcp --dport 3000 -j ACCEPT

4. Start and enable the Grafana service

systemctl start grafana && systemctl enable grafana

5. Go to the Grafana UI and walk through the rest of the installation steps.

Open up your web browser and access Grafana by going to the server's IP address and then port 3000. ex http://192.168.12.132:3000

grafanahome

6. Creating a Prometheus data source

Select the gear icon on the left hand column and select "Data Sources". In the next page click on the green button for "Add data source".

After that page, select "Prometheus"!

datasource

You can give it a name you want, I kept mine as "Prometheus". For the URL put the URL of your prometheus server.

datasource1

Click on the "Save & Test" button to verify the connection.


Creating a Dashboard

Now it's time to visualize all of those metrics from Prometheus. Click on the "+" button on the left hand side bar and select "Dashboard".

grafana

In the new page that opens, click on the "Add Query" box.

grafana

You can click on the Metrics drop-down and select node_load5. This metric will show the 5 minute load average on the CPU.

My graph is has tw0 hosts because I added a node exporter on my Grafana server.

grafana1

If you don't want a line graph you can go ahead and change it to different style by going to the visualization tab. Here you can also change the units for metrics or keep it as a percentage.

grafana

Once you're satisfied with the layout of your widget you can now save the dashboard!

grafana3


Time to Containerize!

Before you continue, if you're perfectly happy with the setup now and don't mind downloading new binaries of Prometheus and Alert Manager, then installing them by hand, you can stop here.

If you're a lazy efficiency-oriented individual like me and would rather have Docker do the heavy lifting, then this section is for you!

For this container setup you will need Docker-CE. I'm using the latest version of docker-CE which is version 19.03.5. If you need installation help, go here. You will also need docker-compose (I will walk you through the install of docker-compose).

Creating Prometheus and Alert Manager containers:

  1. Install dependencies

Prometheus or to be more specific Alert Manager requires some dependencies. For Debian based distros install the following:

apt install build-essential libc6-dev -y

2. Install docker-compose

curl -L "https://github.com/docker/compose/releases/download/1.25.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

chmod +x /usr/local/bin/docker-compose

3. Download container images

Prometheus

docker pull prom/prometheus

Docker Hub repo: https://hub.docker.com/r/prom/prometheus

Alert Manager

docker pull prom/alertmanager

Docker Hub repo: https://hub.docker.com/r/prom/alertmanager/

4. Create the docker compose file

I'm going to create a single docker-compose file for Prometheus and Alert Manager.  I'm choosing to combine them into one docker-compose file because Alert Manager is component of Prometheus. If Prometheus isn't up and running Alert Manager is useless.

version: '3.7'
services:
  prometheus:
    image: prom/prometheus
    container_name: yo_its_my_prometheus
    restart: always
    ports:
      - 9090:9090
    volumes:
      - /appl/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - /appl/rules/alert.rules:/appl/rules/alert.rules
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
  alertmanager:
     image: prom/alertmanager
     container_name: yo_its_my_alertmanager
     restart: always
     ports:
       - 9093:9093
     volumes:
       - /appl/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
       - /appl/alertmanager/data:/etc/alertmanager/data
     command:
       - '--config.file=/etc/alertmanager/alertmanager.yml'
       - '--storage.path=/etc/alertmanager/data'
      depends_on:
        - prometheus
docker-compose for prometheus & alert manager

After creating the docker compose file, give it a test to make sure things are running okay. Make sure to stop the Prometheus and Alert Manager services before you bring up the containers.

docker-compose up

4. Edit the systemd service file

Time to edit the systemd service file for prometheus.

If you look at the current prometheus.service file, I'm invoking the prometheus binary directly.

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/sbin/prometheus \
    --config.file /appl/prometheus-2.14.0.linux-amd64/prometheus.yml \
    --storage.tsdb.path /appl/prometheus-2.14.0.linux-amd64/ \
    --web.console.templates=/appl/prometheus-2.14.0.linux-amd64/consoles \
    --web.console.libraries=/appl/prometheus-2.14.0.linux-amd64/console_libraries
StandardOutput=/var/log/prometheus/log_out.log
StandardError=/var/log/prometheusg/log_err.log

[Install]
WantedBy=multi-user.target
previous prometheus.service file

In the new prometheus.service file, I'm utilizing the docker-compose and WorkingDirectory specifies the directory where the docker-compose file lives in.

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
WorkingDirectory=/appl/prom-cont/
ExecStart=/usr/local/bin/docker-compose up
StandardOutput=/var/log/prometheus/log_out.log
StandardError=/var/log/prometheusg/log_err.log

[Install]
WantedBy=multi-user.target

5. Verify Prometheus is up again

Reload the systemd daemon and then start up prometheus. If you open your browser and go to Prometheus's URL you should be greeted by the UI once again.

systemctl daemon-reload

systemctl start prometheus

Conclusion

Whew! That was quite lengthy but it's a great way to learn the inner workings of Prometheus. You may have noticed that Grafana is not running in a container. I'm planning to convert and use the docker Grafana image later on, but I'm fine using "vanilla" Grafana for now.

Like I mentioned earlier, the advantage of using containers is the ease of upgrades. If you need to upgrade Prometheus, all you would do is download the latest image from docker hub and then restart the prometheus service.