This invaluable script does various sorts of analysis of the state of a Slony-I cluster. Slony-I Section 2 recommend running these scripts frequently (hourly seems suitable) to find problems as early as possible.
You specify arguments including database,
host, user,
cluster, password, and
port to connect to any of the nodes on a cluster.
You also specify a mailprog command (which should be
a program equivalent to Unix
mailx) and a recipient of email.
You may alternatively specify database connection parameters
via the environment variables used by
libpq, e.g. - using
PGPORT, PGDATABASE,
PGUSER, PGSERVICE, and such.
The script then rummages through sl_path to find all of the nodes in the cluster, and the DSNs to allow it to, in turn, connect to each of them.
For each node, the script examines the state of things, including such things as:
Checking sl_listen for some "analytically determinable" problems. It lists paths that are not covered.
Providing a summary of events by origin node
If a node hasn't submitted any events in a while, that likely suggests a problem.
Summarizes the "aging" of table sl_confirm
If one or another of the nodes in the cluster hasn't reported back recently, that tends to lead to cleanups of tables like sl_log_1, sl_log_2 and sl_seqlog not taking place.
Summarizes what transactions have been running for a long time
This only works properly if the statistics collector is
configured to collect command strings, as controlled by the option
stats_command_string = true in postgresql.conf .
If you have broken applications that hold connections open, this will find them.
If you have broken applications that hold connections open, that has several unsalutory effects as described in the FAQ.
The script does some diagnosis work based on parameters in the script; if you don't like the values, pick your favorites!
The script in the tools directory called psql_replication_check.pl represents some of the best answers arrived at in attempts to build replication tests to plug into the Nagios system monitoring tool.
A former script, test_slony_replication.pl, took a "clever" approach where a "test script" is periodically run, which rummages through the Slony-I configuration to find origin and subscribers, injects a change, and watches for its propagation through the system. It had two problems:
Connectivity problems to the single host where the test ran would make it look as though replication was destroyed. Overall, this monitoring approach has been fragile to numerous error conditions.
Nagios has no ability to benefit from the "cleverness" of automatically exploring the set of nodes. You need to set up a Nagios monitoring rule for each and every node being monitored.
The new script, psql_replication_check.pl,
takes the minimalist approach of assuming that the system is an online
system that sees regular "traffic," so that you can
define a view specifically for the replication test called
replication_status which is expected to see regular
updates. The view simply looks for the youngest
"transaction" on the node, and lists its timestamp, age,
and some bit of application information that might seem useful to see.
In an inventory system, that might be the order number for the most recently processed order.
In a domain registry, that might be the name of the most recently created domain.
An instance of the script will need to be run for each node that is to be monitored; that is the way Nagios works.
One user reported on the Slony-I mailing list how to configure mrtg - Multi Router Traffic Grapher to monitor Slony-I replication.
... Since I use mrtg to graph data from multiple servers I use snmp (net-snmp to be exact). On database server, I added the following line to snmpd configuration:
exec replicationLagTime /cvs/scripts/snmpReplicationLagTime.sh 2
where /cvs/scripts/snmpReplicationLagTime.sh looks like this: |
#!/bin/bash
/home/pgdba/work/bin/psql -U pgdba -h 127.0.0.1 -p 5800 -d _DBNAME_ -qAt -c
"select cast(extract(epoch from st_lag_time) as int8) FROM _irr.sl_status
WHERE st_received = $1" |
Then, in mrtg configuration, add this target:
Target[db_replication_lagtime]:extOutput.3&extOutput.3:public at db::30:::
MaxBytes[db_replication_lagtime]: 400000000
Title[db_replication_lagtime]: db: replication lag time
PageTop[db_replication_lagtime]: <H1>db: replication lag time</H1>
Options[db_replication_lagtime]: gauge,nopercent,growright |
Alternatively, Ismail Yenigul points out how he managed to monitor slony using MRTG without installing SNMPD.
Here is the mrtg configuration
Target[db_replication_lagtime]:`/bin/snmpReplicationLagTime.sh 2`
MaxBytes[db_replication_lagtime]: 400000000
Title[db_replication_lagtime]: db: replication lag time
PageTop[db_replication_lagtime]: <H1>db: replication lag time</H1>
Options[db_replication_lagtime]: gauge,nopercent,growright |
and here is the modified version of the script
# cat /bin/snmpReplicationLagTime.sh
#!/bin/bash
output=`/usr/bin/psql -U slony -h 192.168.1.1 -d endersysecm -qAt -c
"select cast(extract(epoch from st_lag_time) as int8) FROM _mycluster.sl_status WHERE st_received = $1"`
echo $output
echo $output
echo
echo
# end of script# |
![]() | MRTG expects four lines from the script, and since there are only two lines provided, the output must be padded to four lines. |
This script is constructed to search for Slony-I log files at
a given path (LOGHOME), based both on the naming
conventions used by the Section 19.3 and Section 19.1.20 systems used for launching slon(1)
processes.
Errors, if found, are listed, by log file, and emailed to the
specified user (LOGRECIPIENT); if no email address is
specified, output goes to standard output.
LOGTIMESTAMP allows overriding what hour to
evaluate (rather than the last hour).
An administrator might run this script once an hour to monitor for replication problems.
The script mkmediawiki.pl , in
tools, may be used to generate a cluster summary
compatible with the popular MediaWiki software. Note that the
--categories permits the user to specify a set of
(comma-delimited) categories with which to associate the output. If
you have a series of Slony-I clusters, passing in the option
--categories=slony1 leads to the MediaWiki instance
generating a category page listing all Slony-I clusters so
categorized on the wiki.
The gentle user might use the script as follows:
~/logtail.en> mvs login -d mywiki.example.info -u "Chris Browne" -p `cat ~/.wikipass` -w wiki/index.php
Doing login with host: logtail and lang: en
~/logtail.en> perl $SLONYHOME/tools/mkmediawiki.pl --host localhost --database slonyregress1 --cluster slony_regress1 --categories=Slony-I > Slony_replication.wiki
~/logtail.en> mvs commit -m "More sophisticated generated Slony-I cluster docs" Slony_replication.wiki
Doing commit Slony_replication.wiki with host: logtail and lang: en |
Note that mvs is a client written in Perl; on Debian GNU/Linux, the relevant package is called libwww-mediawiki-client-perl; other systems may have a packaged version of this under some similar name.