What is the difference between distributed and single-node databases?
The following blog post is the sixth chapter in a primer series by Michael Kaminsky on databases. You can read Chapter Five here if you missed the second lesson about transactions. You can also watch the series here.
Let’s start our discussion of distributed databases by comparing distributed and single-node databases, and discussing where you might have encountered these technologies before.
Distributed databases are made up of, and store data on, multiple computers. By contrast, single-node databases run on a single computer (we often refer to these computers as “servers,” but it’s good to remember that “server” is just another name for a computer).
Examples of distributed databases include Google Spanner and Azure Cosmos, as well as all of the “big data” warehouses like Redshift, Snowflake and BigQuery. Single-node databases include the “classic” databases like PostgreSQL, MySQL, SQLite and many others.
Distributed databases were developed for a few key reasons:
For some applications, we might need to store more data than can fit on any one computer.
Additionally, analyzing huge volumes of data takes a very long time. We can speed up our queries by using the computational power of multiple computers at the same time.
Finally, resiliency is extremely important for many applications. We can’t allow our whole system to go down if our database hardware fails or if there is a network error. If we have all of our data stored on one computer and that computer crashes, then our system will completely stop working.
The first question you might ask is “why can’t we just use bigger and better computers?” Bigger and better computers do work up to a certain point. A lot of people working with PostgreSQL or MySQL databases, for example, will upgrade the underlying hardware for that database for a while before needing to switch to a distributed database.
Big computers, however, have the following disadvantages:
Big computers are really expensive. The bigger the computer, the more expensive, and it scales non-linearly. A computer that has 128 CPUs is going to be more expensive than 16 computers that have 8 CPUs each. Computers get disproportionately more expensive as you make them bigger and bigger.
There’s a limit to how big anyone can make a single computer. At some point you’ll reach a level where you just simply can’t buy a computer bigger than the one you currently have.
The more specialized your hardware is, the more likely it is to break. Specialized hardware is more fragile, has more moving parts, and is more difficult to replace, so increasing the size of our computers exposes us to more risk of a breakdown.
An important concept in distributed computing is fault tolerance. If our database is all on one computer, and that computer goes down, then our app goes down for all of our users.
If we have an application or a system that’s mission-critical or business-critical, we don’t want the whole system to be taken down if and when there are small breakages, the power goes out in our data center, or the disk fails on our computer. These are all regular, otherwise routine stoppages.
With a distributed database, our database is made up of multiple relatively smaller computers that act together as one. This adds redundancy to the system. If one computer in our database goes down, our database can continue functioning. We can deal with more data this way and also improve our fault tolerance.
You will see the following terminology a lot if you read marketing materials or documentation about distributed databases.
Cluster – In general when we’re talking about a distributed database that’s made up of multiple different computers, we will refer to the computers collectively as a “cluster.” So if you see the words “database cluster” it normally means that there are multiple computers underlying the database.
Node – Instead of saying the word “computer,” in general we use the term “node” to describe the individual machines that make up the distributed database cluster.
So if we have a “cluster” that’s made up of four “nodes,” what that really means is that there’s one database made up of four individual computers that are networked together.
There are two big paradigms of distributed databases. The first is “big compute” and the second is “high availability.”
In big compute databases we “split” or “shard” the data across different nodes, and each node executes the query against a subset of the data. Then, all of the results are combined. This allows us to process our data faster by having multiple computers each do a piece of the work, rather than having one computer do all of the work.
In high-availability databases, we make redundant copies of the data on different nodes, making the system extremely fault-tolerant. If one node breaks or falls out of contact with the rest of the nodes, queries can still complete and users can still use the system. This reduces dependence on any one piece of hardware.
Here is a simplified example of splitting or sharding data evenly across three nodes in a cluster. Consider the following table:
Each node in our cluster has ⅓ of the data in our table:
You’ll notice that the data doesn’t overlap at all, and the data on each node is distinct.
In the ideal case, the data will be divided evenly across our three nodes, though that isn’t strictly required in practice.
Here’s how this query will be processed by our distributed database cluster. Here we have a query to get the count of users in each state in our table:
The distributed database will send the query to each of the different nodes and get the count of the users in each state for the data that is within that node. We end up with three tables of intermediate results:
SELECT state, COUNT(1)
GROUP BY state;
We can then sum them up into the final output:
By splitting up this query across these three nodes, we can actually get results back three times as fast as we would if there was only one computer that was trying to process the query for all of the data at once.
High-availability databases, by contrast, don’t process large amounts of data quickly but rather are extremely fault tolerant. There are many ways that high-availability databases work, but the simplest approach is to put a full copy of all of the data on each node in the cluster. If we have three nodes, then we have triplicated our data. Even if two of our three nodes are down, the system will still be able to run queries.
With the big-compute paradigm, we’re going to crunch lots of data and handle heavy analytical workloads. If you’ve worked with Redshift, BigQuery, Snowflake or any of the map-reduce family of tools, then you’ve worked with big-compute distributed databases.
Under the high-availability paradigm, we’re going to be working with mission-critical databases and applications. Some existing tools you might have heard of include Google Spanner, Azure Cosmos and many others.
In short, distributed databases allow us to store more data and scale horizontally instead of vertically by adding more small computers instead of continually increasing the size of one main computer. Distributed databases can allow us to perform queries in parallel over large amounts of data, leading to massive efficiency gains, and can also make our systems more resilient by increasing fault tolerance in the case of hardware or network failures.
In our next lesson we’ll cover some of the unique problems that go along with distributed databases.
The series continues with Chapter 7 here.