Consider the following when choosing database processing options.
If you spend any amount of time in the business intelligence (BI) and data warehousing world, the perennial debate of OLTP vs OLAP will inevitably appear, much like its sibling dyads, “ETL vs ELT” and “Data Warehouses vs Datalakes.” With all these questions, the OLTP vs OLAP one is particularly bothersome because it predates columnar databases and is now mostly redundant. Let’s dissect the two acronyms and evaluate pros and cons of each.
Before proceeding further, let us get the obvious out of the way:
OLTP: Online Transactional Processing
OLAP: Online Analytical Processing
The two acronyms only differ by one word: “transactional” versus “analytical.” The fact that “online” appears in both cases indicates that these are both cloud-native approaches to handling data.
In short, the choice between OLTP and OLAP is merely a choice between transactional and analytical use cases, which at this point is a false dichotomy. Transactional and analytical services are different use cases with different requirements that ultimately require different designs. With 99% certainty I can say no one is using a data warehouse (DWH) as a transactional system.
In practice, “OLTP vs OLAP“ is cited to dissuade the opposite behavior, i.e. using a transactional database for analytics. Engineers who are experienced with and understand transactional systems often instinctively clone transactional databases for analytics, despite the mismatch between the tool and the use case.
If you are building an Analytics system, keep the start simple. Build a monolithic DWH and don’t fool yourself that a DWH stand-in will work or be cheaper. Choose an industry standard columnar datastore as the center of your Analytics stack.
An Online Transaction Processing system, in simple terms, handles daily business: making sales, registering users, updating shipping addresses & removing items from a cart. An archetypal example is a shop, such as an ecommerce site, or even the behind-the-scenes software at a brick-and-mortar store. The business of selling items is literally transactional. Transactional behaviors, in the BI/Data space, are those that:
Involve small, but frequent, units of work
Revolve primarily around individual and unique elements of data - eg, one user is added, one address is updated, one item is removed from the cart
Involve reading and modifying data at equal or similar importance
Require low latency and immediate syncing: If an item is removed from the cart, it must be immediate. Imagine negative feedback from customers in a world where deleted items don’t clear from the cart when the user reaches the checkout.
Are concerned primarily with recording the current state of the subject
One might observe that the above sounds like almost any modern web application. Indeed, anything that has a login, or a ‘create X’ function is more than likely an OLTP system. This blog you are reading is a good example – under the hood users may create, remove and edit articles, and so on. Fivetran’s customer-facing app is another, with transactions such as ‘creating one connection’. Looker is yet another, (update one look).
A nuance to understand is that OLTP refers to entire systems that perform a certain kind of work, not a database. In the case of OLTP, most of the “work” of applying business rules is handled and curated by the application. This application almost certainly uses a database designed for transactional performance (Postgres, MySQL, MongoDB, DynamoDB, Redis) but most often it is the application that does the “processing” and the database merely provides a storage solution in the system. In short, OLTP is application-centric.
Online Analytical Processing systems are focused on aggregating data to provide analysis. They are characterized by scalability and utilization of distributed computing to handle increasingly large datasets. Examples include columnar data stores like Redshift, Snowflake and BigQuery as well distributed systems like Hadoop and Spark. Some defining properties include:
Data is stored and accessed online - this keeps analytics insights synced across all users.
Use cases, which are primarily reports, return summaries across large volumes of data points by means of aggregation.
The frequency of workloads is less but the magnitude may be higher. The average OLAP which serves as a data warehouse for an online shop may periodically process queries spanning thousands of rows.
Writing data is less of a priority than reading data. Any active OLAP system will be continuously reporting on its dataset - whereas, depending on one's data pipeline system, may only be loading new data daily.
Being online a true modern OLAP should make use of distributed computing to allow scalability for complex analytical processing across increasingly large datasets.
In the case of an OLTP, most of the processing is done within the application code and the database takes a secondary role. By contrast, in the world of OLAP, our database system – a data warehouse – stands at the center. It is the environment in, or at least on, which the work is done.
The scale and useability of our OLAP system depends on how a Spark Job can shape our data or how quickly Snowflake returns our SQL statement. A team of analysts can build data models for dashboards and reports by executing code against a DWH alone. Some means of data visualization sits atop the DWH and leverages it to do the processing. This is exactly what common BI platforms such as Mode, Qlik, Looker, and more do - they outsource the processing to a client’s data warehouse and directly return the processed results.
An Offline Analytical Processing system is one whose primary concern is analysis, but need not be available in a distributed way. It is still extremely common, as exemplified by the popularity of Microsoft Excel. Classic PowerBI & Tableau are two other notable examples.
Offline analytical processing fundamentally fulfills the same need as OLAP systems, albeit on a local machine. If we consider a report built in Excel we can see some distinctive properties of offline analytical processing:
The data is stored and computed locally, i.e. offline. Making changes doesn’t affect copies of the same report others might have.
There is limited ability to accommodate low latency and real-time syncing. Offline analytical processing is used episodically to produce reports. The historical context and movement through time is more important than the current state.
Inputs are changed infrequently and in batch: the input CSV is downloaded from its source and replaced, perhaps on a daily, weekly or even monthly interval.
Offline analytical processing need not involve a database, other than as a store or source of data. The application of business logic and analytical aggregation is performed by and in our tool. The design of these tools hits two problems quite quickly:
Offline report storing leads to divergence and syncing issues. Two reports, once identical, will drift apart on separate computers.
Our processing power is limited by the computer we are using. No meaningful design solution exists to facilitate our data volume scaling from hundreds of thousands of data points to hundreds of millions.
There is a case for using offline analytical processing for small-scale, ad hoc reports, but in general, it’s largely obsolete.
By now, you should have the answer to the question of OLAP vs OLTP. They are systems which are each tuned for different purposes, namely analytics and transactions respectively. In short,
If you are reporting or summarizing large datasets, use an OLAP
If you are logging users in or running row level updates or inserts, use an OLTP
If, as is most likely in practice, you are doing a mix of the above, then you’ll need both database types in different parts of your stack