Comparing VDFS with master-master replication solutions

VisualSVN Server provides multisite replication for Subversion repositories that is based on the VisualSVN Distributed File System (VDFS) technology. The VDFS technology follows the classic master/slave replication architecture, which is often opposed to the master-master replication architecture. This article briefly compares these two replication architectures in the context of Subversion repository replication (where the exact order of transactions does matter) and describes why the VDFS technology is inherently more reliable than the competing master-master replication solutions.

The main marketing claim of the products based on the master-master replication architecture is that there is ‘no single point of failure’. This statement is probably true, but it does not mean that there are no points of failure at all. In contrast, there could be ‘multiple points of failure’ and this aspect should be carefully considered. If one of the peer-to-peer replication partners is down or unavailable, the Subversion repository replication may halt in a state requiring an ‘emergency reconfiguration’ for recovery. So the entire master-master replication system is as strong as its weakest link.

On the other hand, VDFS provides simplicity, robustness and almost the same level of WAN-based performance as the competing master-master solutions. The main principle of VDFS is that the master repository does not depend on its slaves, and the replication continues to work even if one or more of your slave servers are unreachable. A simple way to neutralize the ‘single point of failure’ is to put your master repositories into a fault-tolerant environment.

How the replication can be halted in a master-master replication solution

Master-master replication architecture may work great for relational databases, but this architecture has significant disadvantages when it comes to replicating Subversion repositories, where the exact order of transactions does matter. As it said above, the whole master-master replication system can be halted in some realistic scenarios. An example of such scenario is briefly described below.

Let’s assume that you have three ‘master’ servers located around the world: in New York (named ‘NY-SVN’), in Berlin (named ‘Berlin-SVN’) and in Rome (named ‘Rome-SVN’). Mario, a software developer located in Rome, commits some changes to the Rome-SVN server. Here is what happens during the commit in the master-master replication architecture:

The Rome-SVN server somehow receives the ‘sequence number’ for the new transaction. The sequence number is usually received from a ‘distinguished node’ (which effectively plays the role of the super-master server), but can also be generated using a ‘quorum’ of replication partners (the latter is a slower option because it requires a handshake between all servers).
The Rome-SVN server checks if there are transactions with preceding ‘sequence numbers’ and possibly waits for these transactions to be replicated from other servers. Mario is forced to wait until preceding transactions are replicated to the Rome-SVN server over the WAN (in other words, LAN performance for writes is not always guaranteed for busy master-master replication systems).
When the Rome-SVN server is sure that all the preceding transactions had already been replicated, it locally commits the changes made by Mario.
Other replication partners are notified and begin synchronizing the new transaction from the Rome-SVN server.

Despite the fact that the described behavior looks like a high-performance solution, it makes the whole replication system fragile. Let’s consider what is going to happen in the above scenario when the Rome-SVN server becomes unreachable right after the third step is completed. In such case, other ‘master’ servers won’t be able to replicate the changes made by Mario and the replication will be halted until the Rome-SVN server becomes reachable again. Moreover, manual emergency reconfiguration would be required if the Rome-SVN goes down permanently.

How VDFS ensures that a master repository is the only point of failure

The VDFS technology takes another approach and works in a distinct manner. Assume that you have the same three servers: in New York (named ‘NY-SVN’), in Berlin (named ‘Berlin-SVN’) and in Rome (named ‘Rome-SVN’). Berlin-SVN is the master server, while NY-SVN and Rome-SVN are the slave (or subordinate) servers. Mario, a software developer located in Rome, commits some changes to the Rome-SVN server. Here is what happens during the commit if the replication is performed using the VDFS technology:

The new transaction is simultaneously committed to the Berlin-SVN (master) and to the Rome-SVN (slave) servers. This operation adheres to the following invariants:
1. the master server always remains in a consistent state even if the involved slave server fails at any moment,
2. the new transaction is immediately available on the involved slave server after the commit operation is finished successfully.
The NY-SVN server replicates the transaction from the Berlin-SVN server.

As you can see, the master server never depends on any of the slave servers, so you get a great level of robustness if your master server runs in a fault-tolerant environment. When compared to master-master replication architectures, the entire VDFS replication system is as strong as your master server while master-master systems are as strong as their weakest node.

Moreover, the VDFS technology provides easy disaster recovery for master repositories. In case of a disaster that renders master server permanently inoperable, it is possible to restore it and recover the replication in a short time. For more information read the KB93: Performing disaster recovery for distributed VDFS repositories article.

Conclusions

The above analysis proves that the VDFS technology has significant advantages over the competing solutions based on the master-master replication architecture. With VDFS you can be sure that the following statements are always true:

a master repository is always readable and writable — even if some of the corresponding slave repositories are unreachable,
a slave repository is always writable if there is a connectivity to the corresponding master repository,
replication is resilient to temporary connectivity issues,
read access to the slave repository does not require connectivity to the corresponding master repository.
it is possible to recover if a master repository is lost due to disaster.

Comparing VDFS with master-master replication solutions

How the replication can be halted in a master-master replication solution

How VDFS ensures that a master repository is the only point of failure

Conclusions

See also