Take a deep dive into transactional databases.
The following blog post is the fourth chapter in a primer series by Michael Kaminsky on databases. You can read Chapter Three here if you aren’t familiar with the difference between row and column stores. You can also watch this series here.
When we consider how things can go wrong with databases, we need to think about how computers work at a very basic level. In this lesson we will walk through the steps of certain programmatic operations that a database performs and how the database might respond to different things that can go wrong.
In a database, a transaction is a collection of commands that must all be executed together. It tells us which of the operations completed successfully and what has or hasn’t been saved permanently. In spite of its name, transactions are useful in all databases, not just ones optimized specifically for transactional workloads. However, databases that are optimized for transactional workloads will have more fine-tuned control for managing transactions and processing lots of them very quickly.
Let’s imagine that we’re building a software system at a bank and we are programming the ability to make transfers.
There are three steps to this process:
Subtract $100 from a checking account
Add a row to a “transfers” table
Add $100 to a savings account
This is a fairly simple operation: take $100 from the checking account and move it to the savings account.
Now, what happens if something goes wrong in the middle of this operation? What happens if the power goes out between steps 2 and 3?
If we make our transfer process a transaction, we will know exactly what happens when the power goes out – either the whole transaction will complete and be “committed” all together, or the whole transaction will be “rolled back” (defined below) – and we’ll go back to the database state before the transfer started.
This way we can be sure that we didn’t accidentally misplace some of our client’s money.
Once the system is back up and running, the system or the user can retry the transfer without any issues. By grouping these important steps together, we make sure we don’t lose important data. All of the steps either happen together or don’t happen at all.
Before we further discuss transactions, we must understand database constraints. Database constraints enforce consistency by allowing programmers to tell databases what types of data – and by the same token, attempted transactions – are valid or not.
You might have a column that may only have unique values – such as the user email field in your web application – so each user must use a different email address. You might specify that a column can’t have any null values in it. Or you might specify that a column is an integer and can’t have text values in it. There are several different types of constraints that one might want to enforce.
Any operation that violates a database constraint will result in an error. The database will simply refuse to insert a null value into a non-null column, or add a duplicate value into a unique column.
When we’re talking about transactions we use three very important commands or words: begin, commit and rollback.
“Begin” will start a transaction
“Commit” will complete a transaction and save changes
“Rollback” will abort a transaction that has begun and roll back to the previous state.
As we go through each of these examples, we’ll be using this terminology. You’ll often see these words referenced if you do any reading about transactions on the internet.
Database practitioners very frequently refer to transactions as being “ACID,” an acronym for Atomic, Consistent, Isolated and Durable.
Each of these terms is very important for understanding what a transaction is and how it works.
When we say that a transaction has atomicity, what we mean is that the transactions happen completely or not at all. Transactions never complete partway. Either a transaction completes entirely or fails.
Every command between begin and commit must complete successfully, otherwise we’re going to revert all of the changes that were attempted.
When we repeat this same process but we encounter an error (for some unspecified reason), we can see that after we roll back, the values in our accounts on the right-hand side revert to what they were before we started the transaction.
The next concept, the C in ACID, is consistency. This means that databases must maintain a valid state. By the end of the transaction, the database must obey all of the constraints on the database. The database may never go from a valid state to an invalid state via a transaction.
What this means is that all of the constraints are checked by the end of the transaction to make sure that we did not violate one of these constraints. Otherwise, our transaction will be rolled back.
Consistency allows us to more easily reason about our database and be confident that we don’t end up in an invalid state.
The idea behind isolation is that transactions shouldn’t interfere with each other. We’ll give an example of how this goes wrong if we don’t have good isolation in our database.
Imagine that we have two transactions that the database is running simultaneously, and we want to make sure they don’t interact in a way that causes problems in our database.
Transaction 1 is a long-running and complicated transaction that starts with subtracting 100 from the checking account balance.
Then, while transaction 1 is still running, we start transaction 2 and subtract another 100 from the checking account balance. So we subtracted 100, then another one hundred, and so now our balance is 300.
But what happens if transaction 2 commits, but then our transaction 1 fails and rolls back?
|Transaction 2: Subtract 100||400|
|Transaction 1: Subtract 100||300|
|Transaction 1 Failure; Roll Back||500|
We’re going to end up in a state where the application thinks the second transaction was performed, but the database ends up in a state as if it wasn’t. This is very bad!
We want our transactions to be isolated so that they run independently. In Transactions Part Two, we’ll dive deeply into the different ways that isolation can be a problem and how different databases deal with that.
The last important concept, the D in ACID, is durability. Once a transaction completes, it should be saved to disk and be considered more-or-less permanent.
Before the transaction completes via a “commit,” the changes are not saved to disk. As soon as the commit is performed, the data is written to disk and the changes are stored and considered permanent until the user specifies subsequent changes.
That’s the last step in a transaction -- saving it and writing it down, so that the application or user can reliably access those values in the future.
Let’s apply ACID to our bank account transfer example from before.
Atomicity: by wrapping the three steps of our transfer process in a transaction (subtract $100 from a checking account, add a row to a “transfers” table, add $100 to a savings account)
Consistency: the transfer changes one numerical value to another numerical value
Isolation: trivial, because only one operation is being performed
Durability: the updated record is written to disk once the atomic transaction is completed.
In the next part of the series, we will take an even closer look at isolation levels and how databases achieve isolation using “locks.”
The series continues with Chapter 5 here.