Learn about the history of databases, important terminology, and different types of databases you might encounter.
Databases are super important, and they’re fundamental to how we use computers. They allow us to durably save information over time even if we turn the computer off. They organize data in ways that are convenient for both humans and software to access and analyze.
When we talk about “databases,” we really mean database management systems or DBMS. Colloquially, we will use the term “database” to refer to such systems throughout this series.
There are databases under the hood of lots of different software applications that you use. Any website that has a login probably has a database somewhere in the background. Facebook, Google, any e-commerce store -- all rely on databases.
When computers were first developed, databases as we know them didn’t exist. For a while, there wasn’t even any durable storage – if you turned the computer off, you would lose any data that you had input or created. Imagine computers that operated on reams of magnetic tape where the data was organized in a custom format for a single-use application.
In the 1960s people realized that data stored in a general structured way would be really convenient, and they started developing proto-databases to achieve this. In the 1970s things really took off with the invention of relational database management systems and SQL. This was where we started storing data in tables, unlocking the ability to do way more with computers than what people had imagined before.
It’s difficult to exaggerate how important this development was in the history of computing, as relational databases still underpin a huge amount of the internet and computing infrastructure that we use every day.
The growth of the internet in the early 2000s caused the volume of data that we wanted to store to increase rapidly. This led to the development of NoSQL (not only SQL) and big-data databases to enable building web-based software quickly on a massive scale. The following decade, the 2010s, saw a lot of development made on MapReduce and massively parallel processing. After we stored so much data in the 2000s, we needed infrastructure for analyzing all of that data, so the industry pursued distributed computing to be able to process all of that efficiently.
Today, several interesting technologies are being actively developed. I’m particularly interested in some of the new streaming databases that are optimized for analyzing data as it is coming into the database. Other application-specific databases are being developed for very specific use cases, like time series. As the needs of organizations evolve, we will continue to see innovative database development work into the future.
When I think about databases, it’s helpful for me to think about some of the following broad categories:
Analytical vs. Transactional
Relational vs. Non-Relational
Distributed vs. Single-Node
In-Memory vs. On-Disk
Any time I encounter a new database technology, I think about where it fits in these paradigms in order to understand what types of use cases it’s optimized for and where it might run into difficulties.
These topics are all very important, and they provide a roadmap for what we’re going to cover in this series so that you understand what these terms mean and what different types of trade-offs are made for different paradigms.
We’re going to dive deep on each of these different topics (and why they’re important) in future lessons, but this post will just provide a thirty-thousand-foot overview.
A key distinction for databases is whether it’s analytical vs transactional. In an analytical world, you might be analyzing large amounts of data. Maybe you’re an analyst or a data scientist who is summarizing data to answer a business question.
In a transactional world, you’re managing state for other software applications. You’re keeping track of which users are logged in, which users are logged out, which orders have been shipped or which ones haven’t.
Those are two fairly different use-cases, and we see lots of databases that are optimized for one or the other. We see data warehouses like Redshift, Snowflake, BigQuery or a MapReduce datastore like HDFS on the analytical side, and technologies like PostgreSQL, MySQL, Oracle DB and Microsoft SQL server on the transactional side.
Relational and non-relational make up another very important distinction in databases. In a relational database, data are stored in tables that can be joined together -- one table relates to another and a user or application can follow those relations to get the data they need out of the database.
In non-relational databases, the idea of having different tables is much less important. Data are generally stored in one big blob in things called “documents” which don’t necessarily have a fixed schema or shape. In one document you might have one set of attributes, but in another document, you might have a different set of attributes
The distributed database is another really important concept, and we will eventually dive deep into the subject. With today’s technology, it’s really important to understand distributed computing in general, but at a high level, a distributed database distributes data across multiple different computers or servers. You might have part of your database in a data center in New York, and another part in California, and another part on the other side of the world.
This is in contrast to a single-node database where all of the data are stored together on one computer or server. These are much easier to reason about, but less fault-tolerant and less powerful computationally.
Distributed databases have seen a lot of development in recent years. On the distributed database side, we’ve got tools like Google Cloud Spanner and Azure Cosmos, plus all of the big data warehousing technologies; whereas on the single node side you’ve got traditional databases like PostgreSQL / MySQL and many in-memory databases.
“In-memory” and “on-disk” describe how the data are stored and accessed by the computer. In memory, of course, refers to RAM while on-disk refers to permanent hard-drive storage.
If the entire database is loaded into RAM, the database can be very, very fast because reading data from disk is the slowest way of accessing data. However, if the database is saved on the hard disk, it’s going to have more properties like the ability to store more data since computers tend to have more hard-drive space than RAM, and also experience a lower risk of losing data if the program crashes.
Some popular in-memory databases include Redis and MemSQL. On-disk databases include most traditional databases and data warehouses.
We can’t discuss databases without talking a little bit about SQL, which stands for “Structured Query Language.” SQL is a language for interacting with databases. It’s definitely different than a lot of other programming languages that people have experience with.
What makes SQL unusual is that it’s a declarative programming language, which means that you have to describe to the database the type of output that you want, and then the database is in charge of figuring out the best way to get you that output.
People tend to refer to SQL as one thing, and there are in fact ANSI / ISO standards for what SQL should look like; but in fact, the way it’s implemented for every database is different. If you’ve used SQL in one type of database, probably 90% of things will be the same in another, but there will be minor differences in the dialect. Every database has its own unique flavor of SQL that it uses for different types of operations.
It’s not too difficult to overcome this once you’re familiar with SQL, but it’s important to know that you can’t just take a query for PostgreSQL and expect it to work right away in a MySQL database. There will have to be some translation between the two technologies.
It’s also important to know that some databases don’t use SQL at all. Redis, for example, does not have a SQL interface.
We’ll have a whole lesson on SQL in the future, but this is generally what it looks like:
CREATE TABLE users AS
WHERE attribute IS TRUE;
Here we see that within SQL there are different types of operations. DDL, or “data definition language,” tells the database about the types of actions we want to perform – creating tables, specifying column types, etc.
We also have the data query language which is how we describe to the database what we want from the database, and both of these types of operations (plus many others!) are all valid SQL.
In the next few pieces, we’ll use these underlying concepts to get our hands dirty and give you a robust understanding of how and why these databases work the way that they do.
The series continues with Chapter 2 here.