In this excerpt from The Essential Guide to Data Integration, we discuss the technical considerations for choosing a data integration tool.
The following blog post is an excerpt from the book, The Essential Guide to Data Integration: How to Thrive in an Age of Infinite Data. The rest of the book is available to you for free here.
Continued from Top Business Considerations for Choosing a Data Integration Tool.
The basic component of every ELT data pipeline is the data connector. A data connector ingests data from an API or database log and then loads it into a data warehouse. When evaluating data connector quality, consider the following:
Open-source vs. proprietary. There is a trade-off between voluntary, crowd-sourced talent and dedicated professional attention. Overall, there are more open-source connectors for a wider range of data sources, but proprietary connectors tend to be of higher quality and integrate more seamlessly with other elements of a data stack.
Standardized schemas and normalization. Data from API feeds is not usually provided in normalized form. Normalization fosters data integrity by eliminating redundancy and establishing clear, consistent relations between tables. Since there are only a few ways to normalize a data set, normalization also lends itself to schema standardization, which introduces economies of scale that benefit all users.
Incremental vs. full updates. An initial sync will require querying either the full data set or a large subsection of it, but subsequent updates should not. Does the connector update incrementally using logs or other forms of change detection, or does it query full data sets every time it syncs? Frequent full replications of operational databases introduce the additional danger of interfering with critical business operations.
Different data pipeline tools will support different data sources and data warehouses. Make sure the tool you are evaluating supports those you need. If not, does the provider offer a way for customers to suggest new sources and destinations? Do they routinely add new ones?
You may also need to integrate data from obscure data sources that are unsupported by a standard off-the-shelf connector. Does the tool support cloud-based functions that allow you to combine custom connectors written by your engineers with the rest of your infrastructure? Does the tool support ad-hoc loading and warehousing of data from files?
Highly customizable and configurable tools require engineers who are proficient in scripting languages, highly experienced with orchestration, and good at building robust software. Engineers must also deeply understand each individual data source, or collaborate closely with analysts in order to explore, understand and model the data. Designing a good schema is difficult and can be as much art as science.
In addition, users must correctly configure and maintain the data integration software. This includes reconfiguring pipelines whenever downstream business needs and upstream data sources change. This approach is best suited to organizations with deep technical talent pools that actively want to pursue these challenges, and are supremely confident in their ability to deliver good results.
By contrast, zero-touch, fully managed tools are extremely accessible due to their set-and-forget nature. The connectors are standardized, stress-tested and maintenance-free. Maintenance and future iterations of the connectors becomes the service obligations of experts who fully understand every idiosyncrasy of the underlying data.
The zero-touch approach also allows transformations to be scheduled and performed by analysts using SQL. This makes it far more appropriate for companies that don’t have access to a deep pool of top-shelf engineering talent for pipelines, and want to use their engineering talent for other high-value projects.
The purpose of modern data integration tools is to remove as much manual intervention and effort from the process as possible. To that end, consider the following labor-saving tools and features of automation:
API. It can be extremely helpful to programmatically control the tool, so that administrative functions and other chores can be performed automatically rather than by hand.
Handling data type changes. Upstream schema changes can alter the type of a particular value, i.e., from integer to float. An automated tool must be able to reconcile old and new data types without human intervention.
Continuous sync scheduling. Data from these connectors should either stream into your data warehouse or sync at short, regular intervals. Determine how often your organization needs data to be updated and set it and forget it.
Automatic schema migrations. Does the connector automatically accommodate schema changes with a minimum of disruption to anything downstream, i.e., without deleting any tables or fields? Does the connector avoid full re-syncs whenever possible?
General performance. Finally, consider a number of characteristics that will determine possible downtime for your system, including:
How long does an initial sync take?
Is the data updated incrementally, or is a full sync required every time?
Which conditions trigger a full sync?
How often is the data updated, and how does it match your needs?
Does it stream live? Every few minutes? Once a day?
Downtime and infrastructure costs might not be included in a pipeline tool’s formal pricing structure, but can be a significant burden on your organization if not performed efficiently.
Under ELT, transformations are performed in an elastic, cloud-based data warehouse. Elasticity allows compute and storage resources to be scaled up and down as needed. This eliminates the need to forecast hardware requirements and buy excess capacity.
By contrast, ETL requires a data architecture with an additional stage in the data stack to handle transformations before they are loaded. The data warehouse itself can constrain the volume of data that is loaded, making transformations necessary in order to limit the volume and flow of data.
Transformations performed within the data warehouse are non-destructive — that is, they allow the underlying data to remain completely undisturbed while additional tables featuring the desired models are created. Defunct values can be flagged without being deleted (net-additive integration). This means that failed transformations have no permanent consequences and can be repeatedly attempted (idempotence).
Transformations within the data warehouse can be written in SQL, making them accessible to analysts. Analysts can also adjust models to changing business requirements without losing any data.
Cybersecurity and privacy are highly fraught subjects, both legally and in the public imagination. The following is a list of considerations:
Regulatory compliance. Your data integration provider should at a minimum be aware of such standards as GDPR, SOC2, HIPPA and other relevant regulations.
Owning your data. Your data integration provider should not access or retain the data for longer than is necessary to replicate it.
Roles with varying levels of access. The tool should feature a range of roles from administrator to read-only.
Column blocking and hashing. As a matter of security and regulatory compliance, you should be able to encrypt or omit personally identifiable information (PII) from every table that you sync.
Read the final installment of this series here.
The excerpt above is from The Essential Guide to Data Integration: How to Thrive in an Age of Infinite Data. The book covers topics such as how data integration fuels analytics, the evolution from ETL to ELT to automated data integration, the benefits of automated data integration, and tips on how to evaluate data integration providers. Get your free copy of the guide today:
Update your browser to view this website correctly. Update my browser now