For some time, we have been utilizing PostgreSQL’s hot standby replication feature in both our staging and production environments. Currently, the hot standby serves three functions:
Standby server for maximum uptime if the master fails.
Disaster recovery if the master fails completely.
Read-only batch operations like taking nightly backups.
All three of these functions are critical to the safety of our data, so we need to be sure that the master and slave are properly communicating at all times. We use MonitMonit and M/Monit for most of our application and server monitoring. Monit is a daemon that runs on each of our servers, and performs checks at regular intervals. M/Monit is a centralized dashboard and alert service to which all of the Monit instances report. To help ensure that we get alerts even if our network is completely offline, our M/Monit host is hosted by AWS.
Because replication is so important, I have taken a belt and suspenders approach to monitoring the replication lag. This means that Monit is checking the replication status on both the master and the slave servers. The approach uses Monit’s check program functionality to run a simple python script. If the script exits with an error (non-zero) status, then Monit will send an alert to our M/Monit server. M/Monit will then send emails and slack notifications to us.
This script queries the database to ascertain that it is in the right state (recovery). It also queries the current xlog position from the master, and compares it to the last reply location of the slave.
print"Slave server replication is behind master by %f bytes" % slaveXlogDiffBytes
You may wonder why I chose python instead of Bash or my usual favorite: Node.js. Python is installed in our base server image, while Node is not, and I want to keep out database servers as stock as possible. I chose python over bash because I find that bash scripts are brittle and difficult to debug.