In this excerpt from The Essential Guide to Data Integration, we describe data integration, the rise of the cloud, and the difference between ETL and ELT.
The following blog post is an excerpt from the book, The Essential Guide to Data Integration: How to Thrive in an Age of Infinite Data. The rest of the book is available to you for free here.
Continued from An Overview of Data Integration and Analytics.
Data integration consists of the following steps:
Data is gathered from sensor feeds, manual data entry or software, and stored in files or databases.
Data is extracted from files, databases and API endpoints and centralized in a data warehouse.
Data is cleansed and modeled to meet the analytics needs of various business units.
Data is used to power products or generate business intelligence.
A scalable, sustainable approach to analytics requires a systematic, replicable approach to data integration — a data stack.
A data stack consists of tools and technologies that collectively integrate and analyze data from a variety of sources. The components of a data stack include:
Data pipeline and data connectors. Software used to extract data from a data source and load it into a data warehouse.
Data warehouse and/or data lake. A data repository of record designed to permanently accommodate large amounts of data.
Data modeling and/or transformations. Oftentimes, it may be necessary to prepare your data by applying custom business logic, such as changing column names or conducting aggregations.
Business intelligence tool. Software meant for summarizing, visualizing and modeling data in order to guide business decisions.
The most basic unit in a data pipeline is a piece of software called a data connector. A data pipeline may contain one or several connectors, each of which extracts data from a source and routes it to a data warehouse. Transformations can either be performed before the data arrives in a data warehouse, or within the data warehouse after the data arrives. Finally, the data is analyzed with the help of a business intelligence tool. The individual components of a data stack can be hosted on-premise or in the cloud.
The traditional approach to data integration, known as extract-transform-load (ETL), has been predominant since the 1970s. The acronym ETL is often used colloquially to describe data integration activities in general. ETL evolved at a time when computing power, storage and bandwidth were scarce and expensive.
An ETL system performs the following steps:
Extract – data is extracted from connectors
Transform – through a series of transformations, the data is rearranged into models as needed by analysts and end-users
Load – data is loaded into a data warehouse
Visualize – the data is summarized and visualized through a business intelligence tool
Transformations must be specifically tailored to the unique configurations of both the original and the destination data. This means that upstream changes to data schemas, as well as downstream changes to business requirements and data models, can break the software that performs the transformations.
Since ETL does not directly replicate data from each source to the data warehouse, there is no comprehensive repository of record for analytics. Failures at any stage of the process will render the data inaccessible to analysts and require engineering effort to repair.
The traditional ETL process has three serious and related downsides:
Complexity. Data pipelines run on custom code dictated by the specific needs of specific transformations.
Brittleness. Parts of the code base can become nonfunctional with little warning, and new business requirements and use cases require extensive revisions of the code.
Inaccessibility. More importantly, ETL is all but inaccessible to smaller organizations without dedicated data engineers.
Even a casual observer of technological trends knows that computation, storage and bandwidth have become cheap and ubiquitous.
The convergence of these three cost-reduction trends has created the cloud — namely, the use of remote, decentralized, web-enabled computational resources. Cloud technology, in turn, has given rise to a huge range of cloud-native applications and services unshackled from physical infrastructure.
An extract-load-transform (ELT) stack replaces on-premise technologies with cloud-native SaaS technologies. Properly implemented, the modern data stack delivers continuous data integration and organization-wide accessibility, with a minimum of manual intervention and bespoke code.
Switching the order of the loading and transformation stages addresses each of the three major shortcomings of ETL:
Complexity. The pipeline is simplified — warehousing standard schemas shifts a great deal of pipeline-related work downstream to analysts instead of data engineers.
Brittleness. The pipeline is more resilient and less risky — because transformations are applied after the data is warehoused, breakages caused by changes in source systems mainly affect the analytics layer.
Accessibility. The pipeline is more accessible because it's less labor-intensive to maintain.
In-warehouse transformations enable the creation of derivative tables, called “views,” without altering the source data. This allows organizations to create a repository of record that is immune to changing business needs or upstream schema changes.
Once the data is warehoused, analysts can use SQL to perform transformations at their discretion. Stoppages and failures will no longer cripple the entire data pipeline or consume significant engineering resources.
The simplified, cloud-based nature of an ELT data stack lends itself easily to automation and outsourcing.
The specific activities involved in ELT include detecting and replicating data changes, lightly cleaning and normalizing data, and updating and creating tables. These activities require a deep knowledge of data sources, extensive data modeling and analytics expertise, and the engineering know-how to build robust software systems. Without an automated data integration tool, your team must perform these activities and develop the requisite capabilities.
The main benefits of automated ELT, as with most forms of automation, are savings of time, effort and money. Your data or business intelligence team should focus on providing actionable insights, not on routine, upstream work focused on problems that have already been identified and solved.
Data engineers can leverage the time savings of automated ELT to shift their efforts toward problems impacting external customers, or to pursue higher-value data activities such as machine learning and artificial intelligence. Automated ELT is best thought of as a force multiplier rather than as a replacement for human talent.
Click here to read the next installment in this series!
The excerpt above is from The Essential Guide to Data Integration: How to Thrive in an Age of Infinite Data. The book covers topics such as how data integration fuels analytics, the evolution from ETL to ELT to automated data integration, the benefits of automated data integration, and tips on how to evaluate data integration providers. Get your free copy of the guide today: