Databases Demystified Chapter 8 – Distributed Databases Part 3 | Blog

The following blog post is the eighth chapter in a primer series by Michael Kaminsky on databases. You can read Chapter Seven if you missed the second lesson about distributed databases, or watch the complete series.

Consensus means broad agreement on the truth. In the context of distributed databases, achieving consensus requires agreement between every node, even when we have faulty communication channels. We must ensure that nodes agree on the truth so they don’t provide inconsistent results to queries.

This is tricky because there are lots of ways that distributed databases can go wrong. From slow network connections to individual nodes going out of service, our distributed databases need a way to get the whole system to agree on the correct values.

The Two Generals Problem

Before we discuss consensus in more detail, consider one of my favorite thought experiments in computer science, the Two Generals Problem. It illustrates the difficulty of coordinating an agreement between two parties when there’s a faulty communication channel.

Suppose we have two generals on one team who are trying to defeat an opposing general. Scipio and Marcus of Rome are trying to defeat Hannibal of Carthage. Hannibal’s army is big enough that it can defeat either Scipio or Marcus separately, but not both of them together.

In order to win, Marcus and Scipio must attack at the same time. If either attacks alone, they will lose. The way the battle map is laid out, Scipio and Marcus are on opposite sides of the valley where Hannibal has his camp.

The wrinkle is that the only way for Scipio and Marcus to communicate is by sending messengers through the valley where Hannibal is encamped. We know that some messengers won’t make it. Hannibal is on the lookout for messengers and will likely catch a few.

Marcus and Scipio must reach a consensus about when to attack. If they successfully attack at the same time, they’ll win and return to Rome victorious. If they fail to coordinate and attack at different times, they’ll be trampled by Hannibal’s elephants.

Can we devise an algorithm such that Marcus and Scipio are guaranteed to attack at the same time? An instinctive solution many people come up with is to have the generals send a “confirmation message” back when they’ve received a message:

Marcus sends a proposed time to attack
Scipio replies with, “Message received, let’s attack tonight”

However, this process fails if the confirmation message gets intercepted. Scipio will attack but Marcus won’t. Scipio agreed on the attack timing, but since Marcus never received the confirmation, he will not attack. The Romans lose.

What if we just have another confirmation layer?

Marcus sends a proposed time to attack
Scipio replies with, “Message received, let’s attack tonight”
Marcus replies with, “Received your confirmation”

We can keep adding layers ad infinitum, but we still have the same problem. The last general to send a message never knows if his confirmation message actually made it or not.

There is, in fact, no solution to this problem. There is no way for Marcus and Scipio to arrange a time to attack where they’re both certain that they’ve agreed on the same time. The Two Generals Problem teaches us that there are situations where it’s impossible to reach perfect consensus if the communication channel is faulty.

Consensus in Distributed Databases

If we revisit our distributed databases context, we can liken the two generals to two nodes in our database trying to reach consensus. The messengers that are passing through Hannibal’s camp are messages between nodes that have to travel along an unreliable network. They can be dropped due to network outages or delayed indefinitely due to congested network traffic. The two nodes can’t be sure that they’ve reached an agreement because they can’t be positive that the other node actually received the last message that was sent.

Consensus is generally a problem for us when we have:

Multiple nodes operating simultaneously
Unreliable network connections between nodes

In general, we discuss consensus in the context of high-availability frameworks where we have multiple copies of the same data on different nodes and we need all of the nodes to agree what the true value should be following a series of operations.

Why Not Choose the Most Recent Value?

Why not just use timestamps to identify the most recent value across the nodes, and use the most recent value?

The biggest reason this won’t work is that we can’t actually trust the timestamps on the different nodes. The clocks won’t be perfectly in sync on the different nodes, so we can’t trust that the timestamps across the different nodes match. Even if the clocks are only off by a few microseconds, that can wreck our strategy of just using timestamps.

Moreover, network speeds between nodes may vary, so we can’t necessarily be sure that events sent in one sequence, A then B, will arrive at all the nodes in the same order. Some nodes might actually receive B then A.

Instead of clock time (e.g., 5 p.m. 36 minutes and 29.74 seconds), we strive for agreement on the ordering of the operations that were applied to the database. If our nodes can agree on the correct order of the operations applied, the nodes should be synced up with all of the same values.

A key strategy for making this happen is to have the individual nodes “elect” a leader node in charge of coordinating the log of operations and communicating “truth” to all of the rest of the nodes. So the leader node will determine the log and then tell all of the different follower nodes what the correct ordering should be.

But then the next issue is: How do the nodes reach consensus on which one is the leader? How do the nodes figure out which among them is going to be the leader and in charge of the one true log? Remember that nodes, even the leader node, can fail or fall out of contact at any time, so this is a problem that could happen at any time!

Raft and Paxos

In practice, there are two main consensus algorithms that are used widely today: Raft and Paxos. These algorithms are famously complex and challenging to implement correctly — we’re not going to talk about the details of these algorithms (though if you’re interested I’d definitely encourage you to look up how they work online). What’s most important to know is that they both involve multiple rounds of votes in order for the nodes to align on an agreed-upon truth.

You might have one node proposing a value, other nodes agreeing or disagreeing, and through multiple rounds of voting the cluster as a whole can reach agreement about what the truth should be. This voting procedure has really nice properties that we want in our distributed database.

Consensus in a Nutshell

The important things to take away from this lesson are:

Consensus is the process by which nodes in a distributed database agree on the truth.
The Two Generals Problem illustrates the impossibility of perfect coordination with a faulty communication system.
Raft and Paxos are two really important algorithms for reaching consensus in a distributed database.

Distributed databases are powerful but pose serious coordination-related challenges. As cloud-based services and computing continue to grow in prominence, you will likely only hear more and more about distributed databases. Hopefully this primer has given you a decent working knowledge of the topic!

The series continues with Chapter 9 here.

Databases Demystified Chapter 8 – Distributed Databases Part 3

The Two Generals Problem

Consensus in Distributed Databases

Why Not Choose the Most Recent Value?

Raft and Paxos

Consensus in a Nutshell

Start for free