Mastering Prometheus: Recording Rules and Sending Email Alerts Using AlertManager

@Harsh
7 min readJul 22, 2024

--

Prometheus is a powerful tool for monitoring and alerting, but to get the most out of it, you need to understand how to use recording rules and alerting rules effectively. This blog will explore what these rules are, why they are necessary, how they benefit your Prometheus environment, and provide practical examples of setting them up.

Understanding Recording Rules

What are Recording Rules?

Recording rules in Prometheus allow you to precompute frequently needed or computationally expensive queries and store their results as new time series. This means that instead of running the same complex query repeatedly, you can query a precomputed result, which is faster and more efficient.

Evaluation Time

Recording rules are evaluated at regular intervals defined in the Prometheus configuration. The evaluation time is crucial as it determines how frequently the precomputed results are updated. A well-configured evaluation interval balances between fresh data and system performance.

Why are Recording Rules Necessary?

Recording rules are essential for optimizing the performance of your Prometheus setup. As your environment grows, the number of metrics and complexity of queries increases. Running complex queries in real-time can become slow and resource-intensive. Recording rules solve this problem by:

  • Reducing Query Load: Precomputing results reduces the load on the Prometheus server.
  • Improving Query Speed: Queries that use precomputed results are significantly faster.
  • Enhancing Reliability: Reduces the risk of query timeouts or failures during critical periods.

Benefits of Recording Rules in Prometheus

  • Efficiency: By precomputing and storing results, recording rules ensure that your system remains efficient and responsive.
  • Scalability: As your monitored environment scales, recording rules help manage the increased load by offloading complex computations.
  • Simplicity: Simplifies querying for frequently used metrics by providing a straightforward time series to query against.

Practical: Setting Up Recording Rules

For Setting Up Prometheus Server & Target Node, Follow This Guide :

Prometheus 101: Metrics, Monitoring, Practical Setup and More 🔗

1. Define Recording Rules:

  • Create a separate file named recording-rule.yml and start defining complex queries there.
"recording_rules.yml"

groups:
- name: Prod-rule-group-1
rules:
# For finding total memory used
- record: prod:node_memory:used_gb
expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes) / 1024 / 1024 / 1024
# For Finding the status of node (up or down)
- record: prod:up_info:QAteam
expr: up{Group="QATeam"}

- name: Test-rule-group-1
rules:
- record: test:cpu_seconds_total:avg_idle
expr: avg without(cpu,mode) (rate(node_cpu_seconds_total{mode="idle", Project="website"}[2m]))
  • In Prometheus, groups are used to logically organize and manage a set of recording or alerting rules within a configuration file. Grouping rules can make it easier to maintain, understand, and control the evaluation intervals of related rules.
  • These records are just for examples or testing purpose.

2. Configure Prometheus:

  • Update your prometheus.yml to include the new recording rules file.
  • Now start the prometheus or restart the process.
# For starting the prometheus
./prometheus &

# For restarting the prometheus process
kill -HUP `pgrep prometheus`

3. Verify the status of Rules:

4. Query Precomputed Results:

  • You can now query the precomputed result prod:node_memory:used_gb directly in Prometheus. It will return you the memory used by the node in Giga Bytes.

Understanding Alerting Rules

What are Alerting Rules?

Alerting rules in Prometheus define conditions under which an alert should be triggered. These conditions are based on Prometheus queries, and when met, the alert is sent to the Alertmanager, which then manages the notifications.

Why are Alerting Rules Necessary?

Alerting rules are critical for proactive monitoring. They help in identifying issues in real-time and notifying the appropriate personnel to take corrective action. Key reasons for their necessity include:

  • Proactive Monitoring: Helps in detecting issues before they become critical.
  • Reduced Downtime: Early detection and alerting can significantly reduce system downtime.
  • Automated Response: Alerts can trigger automated remediation actions, minimizing manual intervention.

Benefits of Alerting Rules in Prometheus

  • Real-time Alerts: Provides immediate notifications of issues, enabling quick response.
  • Customizable: Alerts can be finely tuned to match specific conditions and thresholds.
  • Integrated: Works seamlessly with Alertmanager to deliver alerts via various channels (email, Slack, etc.).

Practical: Setting Up Alerting Rules

Step 1: Define Alerting Rules

  • Create a new file alerting_rules.yml and define your alerting rules.
  • Although we can include alerting rules in recording_rules.yml also but it is good to create separate file.
groups:
- name: alerting-rules
rules:
- alert: alert_node_down
expr: prod:up_info:QAteam == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is Down"
description: "{{ $labels.instance }} of group {{ $labels.Group }} is down for more than 15 seconds. It needs immediate action"
  • This alert will check if any node is down or not via this query and if any node is down for more than 1 minute, then it will trigger the alert manager and sent notification.
  • Alerts should be configured at two severity levels, warning and critical: warning should be used for alerts which will be read by a human, but do not require immediate action. Critical should be used for alerts which will immediately interrupt a human and require immediate action.

Step 2: Configure Prometheus:

  • Update your prometheus.yml to include the new alerting rules file.

Step 3: Setup AlertManager

  1. Install Alertmanager: Download and install Alertmanager from the Prometheus website.
wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
tar -xvzf alertmanager-0.27.0.linux-amd64.tar.gz
cd alertmanager-0.27.0.linux-amd64

2. Configure Prometheus to Use Alertmanager: Update prometheus.yml to point to the Alertmanager instance and then restart the prometheus.

kill -HUP `pgrep prometheus`

3. Configure Alertmanager: Create a configuration file alertmanager.yml or if already exist then remove it and create a new one.

route:
receiver: 'Mail-Alert'

receivers:
- name: 'Mail-Alert'
email_configs:
- smarthost: 'smtp.gmail.com:587'
auth_username: '<Your-Mail-Id>'
auth_password: '<Gmail-App-Password>'
from: '<Your-Mail-Id>'
to: '<To-Whome-You-Want-To-Notify>'
headers:
subject: 'ALERT......'
~

4. Start Alertmanager: Run Alertmanager with the configuration file.

./alertmanager --config.file=alertmanager.yml

Step 4: Verification

Practical Example: Testing Alerting Rules

  1. Trigger an Alert: Simulate a condition that triggers the alert, such as deliberately stopping the node exporter (Target).
kill -9 `pgrep node_exporter`

2. Verify Alert in Prometheus: Check the alerts page in Prometheus UI to see the active alert.

  • First few seconds, the Alert will be in Pending state and after 1 minute, it will be in Firing state.

3. Check Alertmanager: Verify that the alert has been received by Alertmanager and the notification has been sent to the configured email address.

Conclusion

Recording rules and alerting rules are fundamental features of Prometheus that enhance its efficiency and effectiveness. Recording rules optimize query performance by precomputing frequently needed results, while alerting rules provide real-time notifications of issues. Together, they enable a robust monitoring and alerting system that can scale with your environment.

By implementing recording and alerting rules, you can ensure your Prometheus setup remains efficient, responsive, and capable of providing timely notifications, helping you maintain the health and performance of your infrastructure.

--

--