Christopher B. Browne's Home Page
cbbrowne@acm.org

3.3. Doing switchover and failover with Slony-I

3.3.1. Foreword

Slony-I is an asynchronous replication system. Because of that, it is almost certain that at the moment the current origin of a set fails, the final transactions committed at the origin will have not yet propagated to the subscribers. Systems are particularly likely to fail under heavy load; that is one of the corollaries of Murphy's Law. Therefore the principal goal is to prevent the main server from failing. The best way to do that is frequent maintenance.

Opening the case of a running server is not exactly what we should consider a "professional" way to do system maintenance. And interestingly, those users who found it valuable to use replication for backup and failover purposes are the very ones that have the lowest tolerance for terms like "system downtime." To help support these requirements, Slony-I not only offers failover capabilities, but also the notion of controlled origin transfer.

It is assumed in this document that the reader is familiar with the slonik(1) utility and knows at least how to set up a simple 2 node replication system with Slony-I.

3.3.2. Controlled Switchover

We assume a current "origin" as node1 with one "subscriber" as node2 (e.g. - slave). A web application on a third server is accessing the database on node1. Both databases are up and running and replication is more or less in sync. We do controlled switchover using SLONIK MOVE SET(7).

You may now simply shutdown the server hosting node1 and do whatever is required to maintain the server. When slon(1) node1 is restarted later, it will start replicating again, and soon catch up. At this point the procedure to switch origins is executed again to restore the original configuration.

This is the preferred way to handle things; it runs quickly, under control of the administrators, and there is no need for there to be any loss of data.

After performing the configuration change, you should, run the Section 4.3.1 scripts in order to validate that the cluster state remains in good order after this change.

3.3.3. Failover

If some more serious problem occurs on the "origin" server, it may be necessary to SLONIK FAILOVER(7) to a backup server. This is a highly undesirable circumstance, as transactions "committed" on the origin, but not applied to the subscribers, will be lost. You may have reported these transactions as "successful" to outside users. As a result, failover should be considered a last resort. If the "injured" origin server can be brought up to the point where it can limp along long enough to do a controlled switchover, that is greatly preferable.

Slony-I does not provide any automatic detection for failed systems. Abandoning committed transactions is a business decision that cannot be made by a database system. If someone wants to put the commands below into a script executed automatically from the network monitoring system, well ... it's your data, and it's your failover policy.

3.3.4. Failover With Complex Node Set

Failover is relatively "simple" if there are only two nodes; if a Slony-I cluster comprises many nodes, achieving a clean failover requires careful planning and execution.

Consider the following diagram describing a set of six nodes at two sites.

Let us assume that nodes 1, 2, and 3 reside at one data centre, and that we find ourselves needing to perform failover due to failure of that entire site. Causes could range from a persistent loss of communications to the physical destruction of the site; the cause is not actually important, as what we are concerned about is how to get Slony-I to properly fail over to the new site.

We will further assume that node 5 is to be the new origin, after failover.

The sequence of Slony-I reconfiguration required to properly failover this sort of node configuration is as follows:

3.3.5. Automating FAIL OVER

If you do choose to automate FAIL OVER , it is important to do so carefully. You need to have good assurance that the failed node is well and truly failed, and you need to be able to assure that the failed node will not accidentally return into service, thereby allowing there to be two nodes out there able to respond in a "master" role.

Note

The problem here requiring that you "shoot the failed node in the head" is not fundamentally about replication or Slony-I; Slony-I handles this all reasonably gracefully, as once the node is marked as failed, the other nodes will "shun" it, effectively ignoring it. The problem is instead with your application. Supposing the failed node can come back up sufficiently that it can respond to application requests, that is likely to be a problem, and one that hasn't anything to do with Slony-I. The trouble is if there are two databases that can respond as if they are "master" systems.

When failover occurs, there therefore needs to be a mechanism to forcibly knock the failed node off the network in order to prevent applications from getting confused. This could take place via having an SNMP interface that does some combination of the following:

3.3.6. After Failover, Reconfiguring Former Origin

What happens to the failed node will depend somewhat on the nature of the catastrophe that lead to needing to fail over to another node. If the node had to be abandoned because of physical destruction of its disk storage, there will likely not be anything of interest left. On the other hand, a node might be abandoned due to the failure of a network connection, in which case the former "provider" can appear be functioning perfectly well. Nonetheless, once communications are restored, the fact of the FAIL OVER makes it mandatory that the failed node be abandoned.

After the above failover, the data stored on node 1 will rapidly become increasingly out of sync with the rest of the nodes, and must be treated as corrupt. Therefore, the only way to get node 1 back and transfer the origin role back to it is to rebuild it from scratch as a subscriber, let it catch up, and then follow the switchover procedure.

A good reason not to do this automatically is the fact that important updates (from a business perspective) may have been committed on the failing system. You probably want to analyze the last few transactions that made it into the failed node to see if some of them need to be reapplied on the "live" cluster. For instance, if someone was entering bank deposits affecting customer accounts at the time of failure, you wouldn't want to lose that information.

Warning

It has been observed that there can be some very confusing results if a node is "failed" due to a persistent network outage as opposed to failure of data storage. In such a scenario, the "failed" node has a database in perfectly fine form; it is just that since it was cut off, it "screams in silence."

If the network connection is repaired, that node could reappear, and as far as its configuration is concerned, all is well, and it should communicate with the rest of its Slony-I cluster.

In fact, the only confusion taking place is on that node. The other nodes in the cluster are not confused at all; they know that this node is "dead," and that they should ignore it. But there is not a way to know this by looking at the "failed" node.

This points back to the design point that Slony-I is not a network monitoring tool. You need to have clear methods of communicating to applications and users what database hosts are to be used. If those methods are lacking, adding replication to the mix will worsen the potential for confusion, and failover will be a point at which there is enormous potential for confusion.

If the database is very large, it may take many hours to recover node1 as a functioning Slony-I node; that is another reason to consider failover as an undesirable "final resort."

3.3.7. Planning for Failover

Failover policies should be planned for ahead of time.

Most pointedly, any node that is expected to be a failover target must have its subscription(s) set up with the option FORWARD = YES. Otherwise, that node is not a candidate for being promoted to origin node.

This may simply involve thinking about what the priority lists should be of what should fail to what, as opposed to trying to automate it. But knowing what to do ahead of time cuts down on the number of mistakes made.

At Afilias, a variety of internal [The 3AM Unhappy DBA's Guide to...] guides have been created to provide checklists of what to do when certain "unhappy" events take place. This sort of material is highly specific to the environment and the set of applications running there, so you would need to generate your own such documents. This is one of the vital components of any disaster recovery preparations.

Google

If this was useful, let others know by an Affero rating

Contact me at cbbrowne@acm.org