Learn about the differences between databases designed to perform day-to-day operations and those designed for analyzing data.
It’s important to understand the two major paradigms for working with databases – analytical and transactional – in order to understand the trade-offs one might make when choosing different database technologies. In this post, we’ll cover the use cases and frequently used technologies for both of these paradigms.
When we think about the difference between analytical and transactional databases, we should consider the two different types of people who use these databases. Data scientists and analysts spend most of their time working with data, but in a very different way than software engineers.
Data scientists and analysts work analytically, executing queries against a database and using the data to build reports, dashboards and predictive models. They care a lot about how the data is stored and how to access it efficiently.
That’s very different from what software engineers experience. Software engineers think about app stability, app maintainability, questions about throughput, and how fast the app responds to certain queries.
It’s the difference between these two types of users that will inform how we think about these different paradigms.
The first thing to keep in mind about analytical workloads is that analysts and data scientists are interested in processing large amounts of information in order to calculate summary statistics. These are called aggregate functions and involve grouping together and performing computations on values from many rows. An analyst might, for instance, be interested in calculating the average value of all of the orders placed last quarter in the state of California.
In order to produce more sophisticated insights, analytical workloads may require complex queries involving multiple steps of joining, filtering and other data processing steps. Sometimes, these analyses are performed on an ad hoc basis and never repeated.
Since analysts and data scientists only analyze data from business operations and do not build systems to perform the operations themselves, analytical workloads only require read-only queries and batch-write data loads.
By contrast, transactional workloads generally do not involve aggregates over multiple objects. Instead, software engineers are usually most concerned with managing the state of one object at a time. Objects include any entity represented by a row in a table, such as “a user” or “an order.” In a transactional workload, we are managing one user, order, patient or similar object at a time. These are often referred to as CRUD (Create, Read, Update, Delete) operations.
A system that performs transactional workloads must be able to accommodate huge volumes of CRUD operations, i.e. transactions, per second. Moreover, the system must be able to ensure the integrity of every transaction.
We can summarize the differences between analytical and transactional workloads in the following way.
Analytical workloads involve:
Calculating complex aggregate functions
Read-only queries and batch-write loads
Ad-hoc, non-routine analyses
Transactional workloads involve:
Operating on one “object” at a time
Precisely managing the state of a database
Supporting many operations per second with high throughput
Some very common transactional databases include open-source systems such as PostgreSQL and MySQL, and proprietary systems such as Microsoft SQL Server and Oracle Database. This is by no means an exhaustive list of databases within these paradigms, but they’re major players in the industry that you have likely encountered before and hopefully now understand better.
In the next part of the series, we’ll dive deep into how these different types of databases function, how they’re architected, why they work the way that they do, and what that implies about the tradeoffs between these different types of databases.
The series continues with Chapter 3 here.