Learn about the relationship between isolation and concurrency.
The following blog post is the fifth chapter in a primer series by Michael Kaminsky on databases. You can read Chapter Four here if you missed the first lesson about transactions. You can also watch this series here.
We will continue our discussion of transactions by covering two very important topics for understanding databases performance: locks and isolation levels.
Modern databases have to be able to work in parallel. In order to be fast, databases need to be able to process multiple transactions at the same time. This is especially true for transactional databases that need to support scalable applications with lots of simultaneous users.
If there are 20 people on your website checking out at the same time, you can’t force the 20th person to wait for all of the other 19 operations to complete first. You want your database to be able to process those transactions at the same time.
This can get complicated, especially if you imagine that multiple customers are purchasing the same item and the in-stock amount of an item must be accurately accounted for.
However, any time you’re doing processes in parallel, if they’re operating on the same underlying data, you face the danger of what computer scientists call “race conditions.”
Race conditions happen when two processes are trying to change the same data, and the result will change depending on which one gets their first — the two processes are “racing” to do something, and only one of them can win
In this example, let’s take the following table:
And apply the following three processes, in which we try to update the balances to different values all at the same time:
WHERE user_id = 1 TO 400;
WHERE user_id = 1 TO 100;
WHERE user_id = 1 TO 700;
Note that these three processes don't have to have been submitted in the order listed above. Perhaps they were kicked off by three different computers at three different branch locations simultaneously, and now we have a race condition.
We have no idea what the final balance value for user ID 1 will be. It all depends on the order that these operations were processed.
If the top one is processed last, the balance will be 400. If the middle one is processed last, the balance will be 100, and if the bottom one is processed last then the balance will be 700.
This can create a real problem for us. We want to make sure that our databases are robust to race conditions so that we can be sure which values are saved in the database. Otherwise, we run a serious risk of introducing bugs into the application.
There are three very common race conditions in the context of databases. It is important to know what race conditions are in general in order to identify the different types of issues a database might encounter.
Dirty reads: a transaction is rolled back while you are reading the data
Non-repeatable reads: essential data is sometimes changed in the middle of a transaction
Phantom reads: additional rows get added or removed during a transaction.
There are many similar phenomena like “write skew,” “dirty writes,” and many others that people talk about or that don’t have well-recognized and agreed-upon names.
All of these are all flavors of the same thing, in which different operations are performed concurrently, and you end up having bugs or unexpected outcomes in your application because things you assume are going to stay constant get changed in the middle of your process.
These can all be very problematic and cause really nasty bugs that can be particularly painful to fix, because they both happen infrequently and violate people’s mental models of what their applications are doing.
Databases have isolation levels, which reflect the “I” in ACID. Isolation helps databases prevent race conditions. The more isolation that we have, the fewer race conditions that we’ll have and the fewer of these types of bugs we’ll encounter. You should know the basics of isolation levels and that different isolation levels exist so that when you’re reading the documentation or marketing materials for a database, you understand what types of bugs you might encounter as the database is processing transactions.
Like all good things in this world, isolation comes with a cost; increased isolation makes databases slower. In many databases, especially transactional databases, we can choose how much isolation we want between our concurrent transactions.
Roughly speaking, there are four different isolation levels that are widely recognized. They cover a spectrum from more isolation to less isolation. In descending order of isolation we have Serializable, Repeatable Reads, Read Committed, and Read Uncommitted.
Serializable means that transactions are processed and yield the same result as if they were processed in series, i.e. running one at a time. That is, the transactions may actually be processed in parallel only as long as we get the same result as if they were processed in series.
At the opposite end of the spectrum, read uncommitted, is the least isolated and therefore has the highest likelihood of generating those bug-causing race conditions.
This Wikipedia article on isolation further explains the nuances of the different levels, if you are interested.
The key concepts to know are that:
Different isolation levels exist
They rank from the most strict to the least strict
Many databases offer the option to choose an isolation level at the expense of performance.
Be careful and do your due diligence! Anything less than serializable can cause bugs, and not all databases that claim to feature the serializable isolation level actually do.
To understand isolation, you really need to understand locks. Locks are, extremely generally speaking, how isolation is implemented in different databases.
People who design databases have to figure out how to make their transactions serializable and they do that by locking the database. Specifically, databases lock individual rows, columns or tables so that other transactions that need to access that data must wait before proceeding.
If two transactions want to operate on the same data at the same time, the database will “place a lock” on the data when the first transaction starts so that any other transactions can’t operate on that data until after the lock is released. Other transactions have to “wait their turn” in order to access that data.
Different isolation levels will have different locking strategies.
Let’s take a look at a stylized example to get a feel for how this might work in practice.
In this example we’ve got two transactions -- transaction 1 wants to update the balance of row 1 and transaction 2 is going to select that balance for some other operation that it’s performing:
UPDATE balance WHERE
user_id = 1 to 400;
SELECT balance WHERE
user_id = 1;
While transaction 1 is running, the database is going to place a lock on that row so that the select in transaction 2 will not be able to run until that lock is released.
By waiting for transaction 1 to complete, transaction 2 will return the value 400. In short, the database dictates that no other transactions can operate on this particular piece of data until the lock is released by the initiating transaction.
Depending on the complexity or type or operation you are performing, there are different levels of granularity for your locks.
You can lock a row:
or an entire table:
For the purposes of speed, you want to lock as little data as possible so that you’re blocking as few other transactions as possible. Some transactions, however, require you to lock large amounts of data.
One interesting, and extremely painful problem related to locking is known as the “deadlock.” Deadlocks can happen when two transactions are trying to operate on the same data at the same time and both of them are waiting on each other and then everything freezes because neither can proceed until the other releases its lock.
In the example that we’ve given here, transaction 1 is trying to move money from account one to account two, and transaction 2 is trying to move money from account two to account one:
UPDATE balance WHERE
user_id = 1 TO 400;
UPDATE balance WHERE
user_id = 2 TO 200;
UPDATE balance WHERE
user_id = 2 TO 50;
UPDATE balance WHERE
user_id = 1 TO 550;
If transaction two is started in the middle of transaction one, depending on how the locks are handled under the hood, there’s a chance that we end up in this deadlock situation where neither transaction can complete because they each have a lock on a row that the other one needs.
The solution to deadlocks lies in database design principles such as reducing lock time, tweaking isolation levels, and other practices.
There are real tradeoffs between isolation level and speed. More isolation means more locks. At a higher level of isolation, your database is going to have to put more locks on different pieces of data that it’s working with. More locks also means slower transactions. The more locks you have, the more likely it is that you’re going to block another transaction which slows down your application.
We want our apps to run fast, because if our users have to spend ages staring at a little spinning beach ball on our website, then they are going to be more likely to bounce or churn. We also want our apps not to produce erroneous data that might cause problems for our users.
Fundamentally, you must determine the appropriate balance between the frequency of bugs caused by race conditions and the performance of the application.
Another key piece of vocabulary related to these concepts is the idea of contention. The more transactions there are that need to access the same rows, the more contention there is.
Under a “production” load, where there are lots of users that are manipulating the data in the database, there will be more contention for those rows and that data than in a testing database that doesn’t have any users.
Performance and database speed under contention can be much worse than when the operations are performed without any contention, without even taking into consideration the additional load on compute resources or memory.
This can be a problem when you’re preparing to perform a complicated database procedure using a testing environment. The test operation completes really quickly but in production, on a database that is actually being used by lots of different users, the contention produces serious problems with locked rows and other blocked processes.
While the database is in use, and the database is trying to put locks on lots of different pieces of data, other processes are slowed. It’s important to know that the way that a database performs under contention can be very different from how it performs when there are no other processes locking up parts of the database.
We can sum up isolation in four points:
Isolation level determines how likely race conditions are in our database. If we have more isolation, we’ll have fewer race conditions, and if we have less isolation we’ll have more race conditions. These race conditions can lead to bugs that can be very difficult to deal with.
Isolation levels vary across and within databases. Every database implements different isolation levels a little bit differently. Read the documentation closely, and even test to make sure that the database will work as expected with parallel processes.
More isolation means more locks and slower transactions, in turn slowing down the application that is using the database.
Database locking can cause unanticipated problems under production workloads. Make sure your test environment can simulate high contention.
In the next chapter, we will tackle the important and fascinating topic of distributed databases and distributed computing in general.
The series continues with Chapter 6 here.
Update your browser to view this website correctly. Update my browser now