A data pipeline is a series of actions that combine data from multiple sources for analysis or visualization.
In today’s business landscape, making smarter decisions faster is a critical competitive advantage. Companies desire their employees to make data-driven decisions, but harnessing timely insights from your company’s data can seem like a headache-inducing challenge.
The volume of data — and data sources — is growing every day across on-premise solutions, SaaS applications, databases and other external sources. How do you bring the data from all of these disparate sources together? Data pipelines.
Simply put, a data pipeline is a set of actions and technologies that route raw data from a source to a destination. Data pipelines are sometimes called data connectors.
Data pipelines consist of these components: a source, a data transformation step and a destination.
A data source might include an internal database such as a production transactional database powered by MongoDB or PostgreSQL; a cloud platform such as Salesforce, Shopify or MailChimp; or an external data source such as Nielsen or Qualtrics.
Data transformation can be performed using tools such as dbt or Trifacta, or can be built manually using a mix of technologies such as Python, Apache Airflow and similar tools. These tools are mostly used to make data from external sources relevant to each unique business use case.
Destinations are the repositories in which data is stored once extracted, such as data warehouses or data lakes.
Let’s look at an applied example. Data pipelines make it possible for data to flow from separate applications and platforms such as Facebook Ads, Google Analytics and Shopify. If a customer experience analyst is trying to make sense of these data points to understand the effectiveness of an ad, they need a data pipeline to manage the transfer and normalization of data from these disparate sources into a data warehouse such as Snowflake.
Simply put, any time data is processed between two points, a data pipeline is involved.
Additionally, data pipelines can also feed data from a data warehouse or data lake into operational systems, such as a customer experience processing system like Qualtrics (although that is beyond the scope of what we are discussing here).
To summarize, data pipelines enable you to centralize data from disparate sources into one place for analysis. You can get a more robust view of your customers, create consolidated financial dashboards and more. Data pipelines also bring the advantage of ensuring consistent data quality, which is critical for reliable business intelligence. Data engineers can build pipelines themselves by writing code and manually interfacing with source applications and databases. Alternatively, they can stop reinventing the wheel and use an automated data pipeline tool instead.
Many companies are modernizing their data infrastructure by adopting a suite of cloud-native tools called the modern data stack. Automated data pipelines are a key component of the modern data stack and enable businesses to embrace new data sources and improve business intelligence.
The modern data stack consists of:
An automated data pipeline tool such as Fivetran
A post-load transformation tool such as dbt (also known as data build tool, by Fishtown Analytics)
Data pipelines enable the transfer of data from a source platform to a destination, where the data can be consumed by analysts and data scientists and turned into valuable insights for the business.
To make sense of how a data pipeline works, let’s look at the anatomy of data extraction and placement from source to destination.
Consider the case of running shoe manufacturer ASICS. The company needed to integrate data from NetSuite and Salesforce Marketing Cloud into Snowflake to gain a 360° view of its customers.
To do so, the ASICS data team looked at its core application data —in this case, from the ever-popular Runkeeper — and combined data on signups for loyalty programs with data from other attribution channels. With a data pipeline, ASICS was able to scale its data integration easily.
There are many variations to the workflow above, depending on the business use case and the destination of choice.
The basic steps of data transfer include:
Sources can include production databases such as MySQL, MongoDB and PostgresSQL, and web applications such as Salesforce and MailChimp. A data pipeline reads from the API endpoint at scheduled intervals.
Data professionals need structured and accessible data that can be interpreted so it makes sense to their business partners. Data transformation enables practitioners to alter data and format to make it relevant and meaningful to their specific business use case.
Data transformation can take many shapes, as in:
Constructive: adding, copying or replicating data
Destructive: deleting fields, records or columns
Aesthetic: standardizing salutations, street names, etc. (aka, data cleansing)
Transformations make data well-formed and well-organized — easy for humans and applications to interpret. A data analyst may use a tool such as dbt to standardize, sort, validate and verify the data brought in from the pipeline.
As with anything in the technology world, things break, data flows included. When your data analytics and business intelligence operations rely on data extracted from various sources, you want your data pipelines to be fast and, more importantly, highly reliable. But data pipelines have a parent-child relationship with processes up the chain. When you’re ingesting external sources such as Stripe, Salesforce or Shopify, API changes may result in deleted fields and therefore broken data flows.
Traditionally, the data pipeline process consisted of extracting and transforming data before loading it into a destination — also known as ETL. This sequence made sense in the past, when companies had to work within the constraints of on-premise hardware. The goal was to keep data small and summarized through transformations before loading to conserve precious data warehouse resources. But the extract-transform-load sequence introduces a series of disadvantages and constraints that you will want to avoid today.
For one, the process is inherently inflexible — transforming data before loading it into the output destination forces analysts to predict every use of the data before a report is ever created. Every change is costly and potentially affects data and code downstream of the initial extraction.
Furthermore, every transformation performed on the data obscures some of the underlying information. This is risky, as anyone familiar with the concept of binning or Simpson’s Paradox knows — it’s dangerous to draw conclusions from data that hasn’t been properly sliced.
Moreover, building a data pipeline is often beyond the technical capabilities (or desires) of analysts. It typically necessitates the close involvement of IT and engineering talent, along with bespoke code to extract and transform each source of data. The ETL approach was once necessary because of the high costs of on-premise computation and storage. With the rapid growth of cloud-based options and the plummeting cost of cloud-based computation and storage, there is little reason to continue this practice. Today, it’s possible to maintain massive amounts of data in the cloud at a low cost and use a SaaS data pipeline tool to improve and simplify your data analytics.
We’ve established how crucial data pipelines are to your company’s analytics efforts. How do you decide whether it makes more sense to build one in-house or buy a third-party product?
Data engineers are scarce and expensive — incredibly valuable to their organizations. In many companies, engineers are charged with building and maintaining data pipelines to extract, transform and load data. Even dedicated ETL tools require teams of high-value staff to maintain and configure.
By definition, a tool should "aid in accomplishing a task." However, more often than not, classic data pipeline (ETL) tools drain the time and enthusiasm of data engineers. Rather than enabling data engineers to thrive, they demand maintenance and attention in a manner akin to leaking pipes — pipes down which companies pour money — with little to show in return.
Consider the cost of building your own data pipeline connectors. Cost varies across region and salary scale, but you can make some quick calculations of the engineering time spent building and maintaining connectors and the total cost of ownership for your organization. The monetary cost will be significant, to say nothing of the costs imposed by downtime.
Data engineers would rather focus on higher-level projects than moving data from point A to point B. Let’s be honest, data engineers offer a bridge between data and the business that no other group can offer. With data transfer (or data pipelining) handled, data engineers are free to catalog data for internal stakeholders and be the bridge between analysis and data science.
As Jeff Magnusson, VP of Data Platforms at Stitch Fix, puts it:
“A common fear of engineers in the data space is that, regardless of the job description or recruiting hype you produce, you are secretly searching for an ETL engineer. In case you didn't realize it, nobody enjoys writing and maintaining data pipelines or ETL. It’s the industry’s ultimate hot potato. It really shouldn’t come as a surprise, then, that ETL engineering roles are the archetypal breeding ground of mediocrity.”
A data pipeline tool monitors data sources for changes of any kind and can automatically adjust the data integration process without involving developers. Essentially, this means that data pipeline automation is the most effective way to reduce the programmer burden and enable data analysts and data scientists, and that any cloud modernization strategy must incorporate a plan for data pipeline automation.
When considering data pipeline products, look for a tool that combines these best practices for pipeline orchestration and automation:
It assesses data source structures to discover the source schemas.
It provides a healthy inventory of preconfigured connectors that can ingest data from a wide variety of files, databases, event streams, cloud-native services and applications.
It maintains an inventory of discovered schemas and can automatically detect schema changes in the data source.
It fully embraces the extract/load/transform (ELT) approach of data integration and transformation.
It fully manages the data normalization to automate the production of data pipelines and produce analysis-ready data assets.
It can monitor for changes to data sources to identify updates and schema modifications and automatically adjust to those changes.
At the same time, it can leverage elastic cloud computing resources to maintain integration at scale.
It is robust and fault-tolerant in providing automated recovery from failure.
It satisfies data protection requirements and provides end-to-end encryption to prevent unauthorized use.
Fivetran automated data connectors are prebuilt and preconfigured, and support 150+ data sources, including databases, cloud services and applications. Fivetran connectors automatically adapt as vendors make changes to schemas by adding or removing columns, changing a data element’s types, or adding new tables. Lastly, our pipelines manage normalization and create ready-to-query data assets for your enterprise that are fault-tolerant and auto-recovering in case of failure.