CLUSTER FAILOVER

redis and cluster | Mar 13, 2016 • Ding Jiao

This command, that can only be send to a Redis Cluster slave node, forces the slave to start a manual failover of its master instance.

A manual failover is a special kind of failover that is usually executed when there are no actual failures, but we wish to swap the current master with one of its slaves (which is the node we send the command to), in a safe way, without any window for data loss. It works in the following way:

  1. The slave tells the master to stop processing queries from clients.
  2. The master replies to the slave with the current replication offset.
  3. The slave waits for the replication offset to match on its side, to make sure it processed all the data from the master before it continues.
  4. The slave starts a failover, obtains a new configuration epoch from the majority of the masters, and broadcast the new configuration.
  5. The old master receives the configuration update: unblocks its clients and start replying with redirection messages so that they’ll continue the chat with the new master.

This way clients are moved away from the old master to the new master atomically and only when the slave that is turning in the new master processed all the replication stream from the old master.

FORCE option: manual failover when the master is down

The command behavior can be modified by two options: FORCE and TAKEOVER.

If the FORCE option is given, the slave does not perform any handshake with the master, that may be not reachable, but instead just starts a failover ASAP starting from point 4. This is useful when we want to start a manual failover while the master is no longer reachable.

However using FORCE we still need the majority of masters to be available in order to authorize the failover and generate a new configuration epoch for the slave that is going to become master.

TAKEOVER option: manual failover without cluster consensus

There are situations where this is not enough, and we want a slave to failover without any agreement with the rest of the cluster. A real world use case for this is to mass promote slaves in a different data center to masters in order to perform a data center switch, while all the masters are down or partitioned away.

The TAKEOVER option implies everything FORCE implies, but also does not uses any cluster authorization in order to failover. A slave receiving CLUSTER FAILOVER TAKEOVER will instead:

  1. Generate a new configEpoch unilaterally, just taking the current greatest epoch available and incrementing it if its local configuration epoch is not already the greatest.
  2. Assign itself all the hash slots of its master, and propagate the new configuration to every node which is reachable ASAP, and eventually to every other node.

Note that TAKEOVER violates the last-failover-wins principle of Redis Cluster, since the configuration epoch generated by the slave violates the normal generation of configuration epochs in several ways:

  1. There is no guarantee that it is actually the higher configuration epoch, since, for example, we can use the TAKEOVER option within a minority, nor any message exchange is performed to generate the new configuration epoch.
  2. If we generate a configuration epoch which happens to collide with another instance, eventually our configuration epoch, or the one of another instance with our same epoch, will be moved away using the configuration epoch collision resolution algorithm.

Because of this the TAKEOVER option should be used with care.

Implementation details and notes

CLUSTER FAILOVER, unless the TAKEOVER option is specified, does not execute a failover synchronously, it only schedules a manual failover, bypassing the failure detection stage, so to check if the failover actually happened, CLUSTER NODES or other means should be used in order to verify that the state of the cluster changes after some time the command was sent.

@return

@simple-string-reply: OK if the command was accepted and a manual failover is going to be attempted. An error if the operation cannot be executed, for example if we are talking with a node which is already a master.