Learn how to get your organization started with automated data integration.
We left off at the end of Part 1 with the topic of outsourcing and automating ELT. Here, in Part 2, we will discuss building vs. buying an ELT pipeline and conclude with a practical guide to assembling the suite of tools you will need for your own data stack.
Based on our experiences at Fivetran, we’ve put together some optimistic back-of-the-envelope calculations.
The first figure you’ll need is the average yearly salary for whoever performs data integration at your organization. This could be a data scientist, data analyst, or data engineer.
Let’s assume an annual salary of $120,000.
Apply a multiplier of 1.4 to the salary in order to arrive at the cost of labor. You must account for benefits and other overhead costs in addition to salary. If you lowball the cost of a data engineer at $120,000, then the total cost of labor is $120,000 * 1.4 = $168,000.
The next figure you’ll need is the number of connectors you will use. For the sake of argument, let’s imagine that you’re using a fairly typical complement of five connectors.
With these figures, you can estimate the time and money spent on engineering.
Assume, optimistically, that it takes about 7 weeks to build a connector and about 2 weeks per year to update and maintain it. So, each connector takes about 9 weeks of work per year.
Let’s say you have five connectors. 5 * (7 + 2) = 45 weeks of work per year.
Use the weeks of work per year to calculate the fraction of a working year this will account for. Then, multiply it by the cost of labor to arrive at the total monetary cost. Assume that the work year lasts 48 weeks once you account for vacations, paid leave, and other kinds of downtime.
If the cost of labor is $168,000, 5 connectors take 45 weeks of work, and there are 48 working weeks in a year, then ($168,000) * (45 / 48) = $157,500.
Of course, many analysts, engineers, and data scientists cost far more than $120,000 a year, and you are likely to end up using far more than five data sources. Many data sources also feature extremely complicated data sets that will take more than nine weeks per year to build and maintain.
By contrast, an automated solution requires no more than a few minutes of setup and the time waiting for the syncs to execute.
The question of building vs. buying is fundamentally a matter of opportunity cost.
Building data pipelines consumes time, money and talent. It’s work that analysts, data scientists, and engineers try to avoid. Data and software professionals aspire to higher-value activities, such as creating reports and dashboards, building predictive models, and creating the infrastructure to productionize reports, dashboards and predictive models.
It can take a very long time to learn the idiosyncrasies of each data source and its underlying data model, and to identify the various corner cases that can cause a pipeline to fail. Moreover, your data integration needs will only grow in complexity over time as you continue to add new data sources. Recall the Blissfully report – your organization is likely to eventually use dozens or hundreds of data sources. You don’t want data from those sources to become bottlenecked by scarce engineering time.
Now that we’ve made the case that buying is better than building, we should review the desired outcomes from choosing an automated data integration tool.
An automated data integration solution can serve a number of goals. Carefully consider which of these are most important to your organization.
Time, money and labor savings – A modern data stack should dramatically reduce your data engineering costs, mainly by eliminating the need to build and maintain data connectors and normalize data. One of our customers, Ignition Group, a South African technology consulting firm, saved $425,000 in data engineering costs. Crossmedia, a New York media agency, gained back 160 hours a week of engineering time.
Expanded capabilities – By increasing data sources without consuming engineering or analyst resources, a modern data stack (MDS) should expand the capabilities of your data team. Docusign tripled the number of data sources it used by automating its data stack.
Successful execution of new data projects, such as customer attribution models – More time and data mean your team will be able to focus on new analytics projects. Pleo used an MDS to create a multi-touch attribution dashboard and better understand revenue sources. Chubbies implemented a modern data stack to combine seven sources of customer data and visualize it, gaining insight into ad spend by channel.
Reduced turnaround time for reports – A modern data stack should dramatically shorten report generation time, ensuring up-to-date reports. ALM generated reports 7x faster by automating the ELT process. Similarly, Bringg reduced the turnaround time for reports by 90%.
Reduced data infrastructure downtime – A modern data stack should dramatically improve reliability and eliminate your maintenance burden. After implementing a modern data stack, Ritual reduced data pipeline maintenance issues by 95%.
Greater business intelligence usage – By combining automated data integration with a modern, intuitive BI tool, a modern data stack should promote data access and literacy across your organization. Before Falcon.io transitioned to an MDS, only its small sales team routinely consulted analytics. After adopting an MDS, it saw active users of BI dashboards increase 10x.
New available and actionable metrics – With additional data sources and an easy-to-use BI tool, a modern data stack should significantly boost the number of metrics used in decision-making. With an MDS, OutSystems increased actionable KPIs from 15 to 60 and improved data access company-wide; the entire organization now participates in defining key metrics. After Zoopla adopted an MDS, it created a KPI overview for its leadership team featuring 40 actionable KPIs.
Your data warehouse will be the repository of record for your organization’s structured data. Different data warehouses offer different features and tradeoffs. Here are the nine criteria you should focus on:
Centralized vs. decentralized data storage – Does the data warehouse store all of its data on one machine, or is it distributed across multiple machines, trading redundancy for performance?
Elasticity – Can the data warehouse scale compute and storage resources up and down quickly? Are compute and storage independent from each other or coupled together?
Concurrency – How well does the data warehouse accommodate multiple simultaneous queries?
Load and query performance – How quickly can you complete typical loads and queries?
Data governance and metadata management – How does the data warehouse handle permissions and regulatory compliance?
SQL dialect – Which dialect of SQL does the data warehouse use? Does it support the kinds of queries you want to make? Will your analysts have to adjust the syntax they currently use?
Backup and recovery support – If your data warehouse somehow gets corrupted or breaks, can you easily revert to a previous state?
Resilience and availability – What about preventing database failures in the first place?
Security – Does the data warehouse follow current security best practices?
Business intelligence tools enable you to easily build reports and dashboards, but different tools have different strengths and weaknesses. Here are the key factors to consider:
Seamless integration with cloud data warehouses – Is it easy to connect this BI tool to your cloud data warehouse of choice?
Ease of use and drag-and-drop interfaces – Ease of use is especially important to popularizing data-driven decisions across your organization
Automated reporting and notifications – Writing reports by hand can get tedious. Does the BI tool allow you to schedule reports to publish automatically? What about alerting users when the data changes?
Ability to conduct ad hoc calculations and reports by ingesting and exporting data files – Your analysts and data scientists might sometimes want to explore data without the overhead of having to go through a data warehouse first.
Speed, performance and responsiveness – Basic quality-of-life considerations are important, like dashboards and visualizations loading in a timely manner.
Modeling layer with version control and development mode – Does the BI tool allow your analysts to work collaboratively by sharing data models and code?
Extensive library of visualizations – Pie charts, column charts, trendlines and other basic visualizations can only take you so far. Does the BI tool feature more specialized visualizations like heat maps or radar charts? Does it allow you to build your own custom visualizations?
Make sure any data warehouses and BI tools you evaluate are compatible with each other. It also pays to carefully review a range of perspectives on different tools. Research firms like Gartner often aggregate such information. Read before you leap!
There are many data integration tools in the market, and their technical approaches and feature sets vary significantly. Here are the foremost factors to consider when choosing a data integration tool:
Data connector quality. Take these factors into account when evaluating connector quality:
Open-source vs. proprietary. There are more open-source connectors for a wider range of data sources, but proprietary connectors tend to be of higher quality and integrate more seamlessly with other elements of a data stack.
Standardized schemas and normalization. Data from API feeds is not usually provided in normalized form, but normalization fosters data integrity and lends itself to schema standardization.
Incremental vs. full updates. Incremental updates using logs or other forms of change detection allow for more frequent updates that do not interfere with business operations.
Support for sources and destinations. Does your prospective tool support your sources and destinations? If not, does the provider offer a way for customers to suggest new sources and destinations? Do they routinely add new ones?
Configuration vs. zero-touch. Zero-touch, fully managed tools are extremely accessible, with connectors that are standardized, stress-tested and maintenance-free. Configurable tools require engineers who have experience with scripting languages, orchestration and software development.
Automation. Integration tools should remove as much manual intervention and effort as possible. Consider whether a tool offers features like automated schema migration, automatic adjustment to API changes, and continuous sync scheduling.
Transforming within vs. before the data warehouse. Avoid tools that use a traditional ETL (extract-transform-load) approach. Under a more modern approach known as ELT (extract-load-transform), transformations are performed in an elastic, cloud-based warehouse, allowing limitless resource scaling.
Recovery from failure. Bugs and errors will inevitably crop up over the course of data integration; you don’t want to permanently lose data as a result. Find out whether your prospective tools are idempotent and perform net-additive integration.
Security and compliance. These are key areas, both in terms of data protection and public perception. Specifically, learn whether prospective tools offer:
Limited data retention
Column blocking and hashing
Many publications offer aggregate reviews and ratings of data integration tools, as they do for data warehouses and business intelligence tools. Be sure to comparison-shop. Make sure the different elements of your data stack – data integration, data warehouse and BI tool – are compatible with each other.
Hopefully, this presentation has given you a strong sense of what modern, automated data integration is all about. To summarize what we’ve talked about in a few sentences:
Analytics is the end goal of data integration. Analytics is fundamentally about drawing valuable insights out of data. This can take the form of reports, dashboards, recommendations and predictive models. The most sophisticated uses of data include systems and machines that can automatically and intelligently make data-driven decisions, namely artificial intelligence and machine learning. Data integration consists of all of the activities that enable analytics. Fundamentally, this includes extracting data from sources, loading it into a repository of record, and transforming it into data models that are usable by analysts.
Data integration can be performed through ETL or ELT. ETL is the traditional, labor-intensive approach, in which transformation is performed before the data is loaded into its destination. The whole purpose of this setup is to minimize the use of computation and storage, and it dates from a time when those things were very expensive. ELT is the more modern approach that leverages the affordability of computation, storage and internet bandwidth. By staging the transformation after the data has arrived in the destination, you remove a lot of brittleness from the system. It allows your transformations, in essence, to be much more modular.
Build vs. Buy – In the overwhelming majority of cases, it makes more sense in terms of time, money, morale and overall efficacy to buy a data integration solution rather than build one. There are some corner cases where regulatory or security compliance or highly specific performance requirements might make building necessary.
Evaluate your desired outcomes and carefully choose the elements of your data stack as you start assembling your modern data stack. There are a number of tradeoffs you’ll encounter from carefully evaluating those tools.