How to Evaluate Data Pipeline Solutions
Centralizing your data brings many business benefits, including heightened data integrity, more robust BI, faster reporting and time savings. Whether you’re just starting to piece together your modern data stack, or looking for alternative solutions, here are a couple of things to consider when evaluating and trialing data pipelines.
Chances are you’re using a number of sources that you wish to centralise your data from. As a first step, determine whether or not the solution natively supports your sources, whether they be applications, databases, files or events. If not, does it offer an alternative means for centralizing the data from that source? In addition to your sources, you want your data ingestion tool to support your data warehouse of choice.
Can your pipeline service handle the quantity of your data while maintaining data quality? Here are some questions to consider:
- Will it centralize all of my data? Start with the most important fields and tables from each data source, which you can confirm in the tool documentation, and don’t forget about any custom objects that are integral to your business analysis.
- Can I easily query the data? Different tools present semi-structured data, such as JSON responses from API calls or JSON files, differently in the destination.
- Does the tool handle data deletions/additions or schema changes well? As a test, add or delete columns and rows and determine if the destination has the expected outcome of adding or otherwise marking these changes.
- Can the tool propagate data type changes seamlessly, or will it force me to make changes manually? Not every data warehouse is compatible with every data type. How a tool handles typecasting and changes to data type impacts the types of queries you can run on the destination.
- If my data is modified, does it update within the specific row or is it added as an entirely new row in the destination? If the row is not updated, you have to deal with appended data and multiple “WHERE” clauses to deduplicate your data.
You want your pipeline to be cost-effective and easy to set up. During a trial, evaluate how easy it is to set up the solution, which includes configuring database settings, getting access permissions from web applications, and column blocking, among other things. Some solutions require specific skill sets to be able to use them, such as an understanding of Python or SQL.
The desire to gain value and insight from your data is likely the reason you began this journey in the first place. Your analysis will be heavily dependent on your specific business needs across the organization. For example, many marketing teams want to track campaign metrics to uncover the highest areas for growth and measure success. Similarly, HR departments can run queries to discover employee satisfaction or to help fine tune the hiring processes.
How you choose to do your analysis is also up to your organization. Whether you choose to run SQL queries or use a business intelligence tool such as Looker or Tableau, think about what your normal workflow consists of and run analyses that would simulate a typical day.
Amidst GDPR and increasingly strict privacy laws, it’s more important than ever to ensure that your data is secure. The tools that handle your data must be just as secure as your organization.
Check that your data pipeline is compliant with third-party privacy standards. In addition to third-party compliance, a third-party audit that adheres to a commonly accepted standard certification, such as a SOC 2, can provide further assurance.
An additional security measure is locking down your database(s) with a specified SSL certificate for tools that interact with the server. Similarly, check if there is an option to connect through a specified SSH server. These options allow you greater control over the encryption mechanisms that protect your data which helps keep your data secure.
A company may have impressive technology, but does it also offer the best possible service? If you run into any issues or had any questions while trialing a pipeline solution were you able to get ahold of someone and get the appropriate answers quickly? Does the tool have a good reputation in the industry, either through direct consumers or partnerships? Is there clear, informational documentation available to help you troubleshoot your issues?
Ultimately, what matters most is that your ETL solution aligns with your individual and company goals. At Fivetran, we want to make your transition to a modern data stack as seamless as possible. If you haven’t yet, give us a try through a free trial or sign up for a demo of our product. If you’re interested in learning how we have helped other organizations centralize their data, check out our case studies.
About Fivetran: Shaped by the real-world needs of data analysts, Fivetran technology is the smartest, fastest way to replicate your applications, databases, events and files into a high-performance cloud warehouse. Fivetran connectors deploy in minutes, require zero maintenance, and automatically adjust to source changes — so your data team can stop worrying about engineering and focus on driving insights.