Automated data integration depends on several key concepts.
This is the introduction to the Fivetran Architecture Academy series, in which we discuss the technological principles underlying how Fivetran works.
Fivetran is at the forefront of automated data integration. Specifically, we believe that extracting, loading and transforming data should be effortless and involve a minimum of human intervention. This is reflected in the design philosophy of our automated data pipeline.
Automated data integration is inextricably tied with the growth of cloud computing. At the level of hardware, continued improvements in disk space, processing and internet bandwidth have led to plummeting costs since the turn of the millennium. Storing and processing vast amounts of data over a network has never been cheaper.
Cheap and efficient access to computation over the internet has created a profusion of cloud-based services, most notably software-as-a-service (SaaS). In turn, the variety, volume and complexity of data have exploded. This new data is often made available through the API endpoints of SaaS providers. Because all users of a particular SaaS product have access to the same API, there are opportunities for third-party vendors to leverage economies of scale and offer automated data integration solutions.
At all size ranges, a typical company now uses dozens of applications that produce data offering valuable insights into its operations.
The growth of the cloud and cloud-based applications has produced a challenge in the variety, volume, and complexity of data, but also an opportunity in the form of cloud-based data integration tools. At Fivetran, we favor a cloud-based modern data stack and have made a number of design choices in pursuit of automated data integration.
The suite of tools and technologies that enable automated data integration is called the modern data stack. The components of the modern data stack include the following:
Data sources – organizations today get data from SaaS applications, operational systems, transactional databases, event trackers, and files in numerous formats.
Automated data pipeline – automated data pipelines are used to move data from sources to data warehouses and data lakes. This task is difficult to implement correctly, and the technical details of extracting and loading data are the focus of this series.
Data warehouse or data lake – in order to draw connections between data from disparate sources, organizations need a platform that offers secure, persistent storage, organized in a manner that is easily accessible to analysts or data scientists. This platform can be relational, designed for structured data (data warehouse) or non-relational and able to accommodate both structured and unstructured data (data lake).
Transformation layer – data at the source is often not organized in a way that is useful for reporting and dashboards. A transformation layer allows analysts to model data as they see fit.
Business intelligence or data science platform – ultimately, the purpose of collecting data is to extract insights to power an organization’s decisions. More sophisticated organizations can use data to power artificial intelligence for automated decision making in operational systems.
An automated data pipeline extracts data from a particular source and then loads it into a data warehouse or data lake. This deceptively simple task is complicated by a number of considerations:
The sequence of steps performed within this stack as data moves from source to data warehouse to business intelligence platform is extract-load-transform (ELT). We can contrast ELT with the more traditional extract-transform-load (ETL). The older ETL approach, intended specifically to preserve machine resources, is brittle because it is prone to failure the moment schemas upstream change, or the desired data models downstream change. It is also obsolete because of the affordability of computation, storage, and bandwidth.
By more-or-less extracting straight from the source to the destination, ELT enables a fully automated, zero-configuration approach to data pipelines.
Although computation, storage, and internet bandwidth are more affordable than ever, it is still more efficient to extract and load only data records that have been changed since the last update instead of full data sets.
The incremental updates approach contrasts with the obsolescent practice of making periodic, whole dataset snapshots. Using incremental updates, a system can make more frequent and granular updates at a much lower cost. Fivetran uses several strategies to accomplish this, such as reading change logs and using change data capture.
Data syncs sometimes fail and must be retried. A data pipeline that is idempotent is effectively self-correcting. Idempotence can be represented by the following function:
f(f(x)) = f(x)
Namely, an operation that is executed multiple times will only ever produce one particular result. In the context of data integration, idempotence allows syncs to be retried while preventing duplications from failed data syncs.
At Fivetran, we are fervent evangelists for idempotence in data integration.
Inferring a normalized schema from a data set requires a keen understanding of the underlying data models and behaviors of an application. This schema, represented by an ERD, communicates a clear understanding of the data to its users.
Normalizing a data set as it is loaded, in turn, prevents duplicate, redundant, and derived values from cluttering the destination. Standardization on the basis of normalized schemas also allows anyone who uses the schema to use templated analytics products.
Schemas can change through the addition, deletion and renaming of columns and tables at the source. Schemas can also change when the data type of a column changes. The ability to automatically detect and accommodate schema changes keeps a data integration system running and prevents downtime.
Traditional data pipelines contain hard-coded column and table names and column data types. These hard-coded references can become invalid the moment names or data types change at the source, leading to failed syncs and broken data pipelines. With an ETL approach, this problem is exacerbated by the even greater number of hard-coded references. A system that can automatically accommodate schema changes, by contrast, ensures that possible breakages are contained to the transformation layer. Such a system is less brittle inasmuch as data is still extracted and loaded.
Past data can be highly valuable. In order to have a true data repository of record, your data pipeline should feature a net-additive approach to data integration, specifically the capability to revisit past versions of rows, columns, and tables.
The ability to revisit past values in rows is handled by a concept called “history tables.” With respect to columns and tables, net-additive data integration overlaps significantly with automatic schema migration. In particular, the concept of lossless replication means that columns and tables deleted at the source remain in the destination. Net-additive data integration is essential to prevent the loss of valuable information.
It is not always obvious what data type should be applied to a field from a raw data feed. This can be further complicated by the different data types supported by different databases. It is essential for a data pipeline to feature a hierarchy of data types to automatically coerce or infer appropriate data types.
Aside from architectural considerations specific to data pipelines, there are also more general considerations around reliability, scalability and security. Fivetran leverages a number of different approaches, using best-in-class cloud platforms and technologies, process/fault isolation, excluding or encrypting sensitive data fields, and other measures in order to meet these needs.
Building a data integration solution is complex, requiring forethought and a principled approach. In the Fivetran Architecture Academy series, we will discuss the considerations listed above, and more, in order to demonstrate the first principles, care and foresight that we applied to building a leading data integration solution.