Indexes aren’t just for catalogs or textbooks. Learn how indexes make queries faster.
The following blog post is the ninth chapter in a primer series by Michael Kaminsky on databases. You can read Chapter Eight if you missed the third lesson about distributed databases, or watch the complete series.
Indexes are used to make queries run faster and they can be found and used in lots of different contexts. Like everything else related to databases, there are trade-offs with using indexes so it’s important to understand how they work.
The benefit to indexes is that they make queries go faster, allowing us to speed up our applications or improve query times for our analytics. The drawback to indexes is that they take up extra space in the database and can make other operations slower.
Database indexes work much like real-life indexes such as card catalogs in libraries and indexes at the backs of textbooks.
Physical card catalogs at a library aren’t very common anymore, but were once used to look up books in the library by topic. The other common place for an index is in the back of a nonfiction book or textbook. In a long textbook that covers a lot of topics, an index in the back can point you toward a certain topic that you want to find again.
Let’s talk through the steps we go through when using a book index. The purpose of this review is to show you how some things that humans do naturally and easily are referred to as “algorithms” when a computer does them. When you use an index, you’re using an algorithm called “binary search” that is very important in computing and makes indexes very powerful.
Let’s imagine we have a dense textbook with 1,000 pages. Our goal is to find a certain topic that the textbook discusses, but we don’t want to flip through every page to see if the topic is there. In the worst-case scenario we’d have to read the whole textbook just to find what we’re looking for! However, if the textbook has an index, the index will point directly to the pages containing the topic we want.
This index is organized by topic, and the topics are sorted alphabetically. The fact that the index is ordered by a known system (i.e. alphabetical) allows us to use binary search to find the topic that we’re looking for.
Binary search is the process of flipping back and forth between entries to “zero in” on the thing that you’re looking for. You might start too low, then flip a bit too high, then flip a bit lower, then a bit higher until you land on what you’re looking for.
Imagine we’re looking for the topic “frogs” in our index.
If we start out in the index and flip to a random page, maybe we see the entry for “trees.” We know that “F,” for frogs, comes before “T,” for trees, so we know the entry for frogs must be earlier.
So we flip to an earlier page, and we see the entry for “dogs”. We know that “F” comes after “D,” so we know frogs must be on a later page, but before the page where we found “trees.”
So we flip to a later page, and we see “lava”. We’re getting closer but we’re not there yet, now we know we need to go back, but not as far back as “dogs.”
We repeat this process until we zero in on the entry for frogs, and then we can look up the correct page in the textbook to learn about the amphibians in question.
This zeroing-in process is what’s known as binary search. It works on the principle that we know how the Latin alphabet is ordered. By contrast, you would not be able to apply the same order to a Russian book in Cyrillic (because you don’t know how the Cyrillic alphabet is sorted!).
There are two major costs to textbook indexes, though:
The index takes up space. There are a lot of extra pages you have to print and carry around with you!
The index makes the book more expensive because you have to pay someone to actually create an accurate index.
Databases use indexes very similarly to how we use indexes in books, so we can use a lot of the same intuition to understand how these work.
Imagine that we have a users table with millions of users in it.
There is an ID column that is incremented as users are created in our application, and the rows are sorted by this column so that the users created earliest are at the top and users created most recently are at the bottom. We also have a first name and last name column which are not sorted in any meaningful way since users are added to this table when they sign up for our application.
If we want to look up a row from this table by last name, it is going to be very slow for the database:
WHERE last_name = ‘kaminsky’;
Consider the name “Michael Kaminsky” – since the data isn’t ordered by last name, the database is going to have to start at the first row, check if the last-name equals “Kaminsky”, then move to the second row, check if last name equals “Kaminsky”, then move to the third row, and so on for every row in the table.
This scenario presents a problem because if we filter or query very frequently, our database is going to be very slow! What we want to do is create an index on the last-name column that will allow us to find rows by last name much faster.
Let’s consider what an index on last-name would look like for that table. What we’re creating is another table that’s going to be sorted such that we can look up the location of a row in the original table by last name. Because this new look-up table is sorted we can take advantage of binary search to make the lookup process much faster!
Consider the following database index:
With this index, when we filter on last name, the database knows that it can use the index that is sorted to be able to identify the Kaminsky row much faster using binary search.
If we start at row 3 and see “Melchor” we know “Kaminsky” has to be before that, so maybe we bounce up to the first row and see “Caro.” But “K” is after “C,” so with the next step we are able to zero right in on the “Kaminsky” row.
With the “orig_row” for “Kaminsky” in hand, we can look up the rest of the information on the original table about this user.
So why not put indexes on everything? All of our database queries will be really fast, right?
As with books, there are two costs associated with database indexes. First, additional database indexes take up extra space in our database. For every index that we’re adding, we’re taking up additional hard drive space in the database because we’re creating additional tables. Depending on the type of database and the size of the data, this can become a real problem.
More importantly, additional indexes can slow down other types of database operations. For every index that we add to a table, updating or adding data will become slower because the database has to not only update the original table but all of the indexes as well.
The more indexes we have, the slower updates and writes are, but the faster reads will be. That’s the fundamental trade-off.
Different types of indexes are specialized for different types of data. These indexes are slightly more efficient for certain types of columns or certain ranges of values. Different databases will have special documentation about using the different indexes appropriately. If you’re fine-tuning those indexes you should go take a look at the documentation to see what your specific database recommends doing.
In short, indexes are a very important and very powerful tool for increasing the efficiency of certain queries in a database. However, increasing the efficiency of reads comes at the expense of slowing down database writes and updates, because the index must be updated. Moreover, indexes take up storage space in a database. This means you have to carefully consider the probable use cases for a particular table before you decide to build an index or not.
The series continues with Chapter 10 here.