Database replication for database analytics involves many serious difficulties.
Database replication is a special instance of data integration, and consists of extracting and loading data from your databases to your data warehouse. Once the data is in a central repository, you can model and analyze the data to produce actionable insights into your operations.
However, building a data pipeline to move data from operational databases to a data warehouse is a highly involved process that includes a wide range of non-obvious technical challenges to solve and features to build, including:
Data type mapping
Replication and sync monitoring
Support and assistance
If your data pipeline does not adequately address the challenges or features listed above, then your data operations will suffer poor performance, serious data integrity problems and demoralized contributors. Here are the reasons why.
Your data pipeline shouldn’t make a full copy of each of your data sources every time you update your destination. While a full synchronization is necessary the first time you ever sync your data, doing it routinely will:
Increase the duration of each sync, making it impossible to provide quick or real-time updates
Bog down both the source database and destination data warehouse, using resources otherwise needed for operations and analytics, respectively
Consume excessive network bandwidth
These problems grow in scale as you add more data sources and the overall volume of your data increases.
The solution is to only sync data that has been updated, rather than entire data sets, i.e., to perform incremental updates. In the case of database analytics, this usually means querying changelogs that contain a full history of updates, new records and deleted records.
Idempotence means that if you execute an operation multiple times, the result will not change after the initial execution. A good example is an elevator console — pushing the button for a particular floor multiple times will achieve the same result as pushing it once.
In terms of data pipelines, idempotence prevents the creation of duplicate records when data syncs fail and must be repeated.
Without idempotence, data syncs that have failed and must be repeated will produce duplicate or conflicting data. This will produce the following problems:
Misleading or erroneous insights – counts, rank orders, sums, averages and other metrics will be thrown off
Broken operations – especially if they rely on one-to-one correspondence between records and identifiers
Wasted storage space and computational bandwidth – you will end up storing a larger volume of data
Data syncs can fail as a result of outages or bugs at the source, i.e. operational databases, within the data pipeline, and at the destination.
Your sources will evolve over time, adding, removing, or modifying fields or tables. Your data pipeline must reconcile changes between the source and destination without destroying old data.
There are several solutions to the different instances of this problem, such as soft deletes, in which deprecated fields or tables are flagged rather than removed. But what about database fields whose values simply change over time?
The ability to track changes to metrics over time is essential to learning from your organization’s past, specifically identifying trends such as changes in usage and the status and growth of accounts over time. If a particular field changes at the source, the old value is lost forever if you don’t keep track of the change. For tables where retrospective analysis might be essential, you will need to find some way to track every version of every record and revisit past versions.
Not all of the data types between sources and destinations will be compatible. You must create a hierarchy of data types and assign the minimum appropriate size in the destination for each field in the source. Over one field and a few hundred or thousand rows, the difference between 16 bytes or 8 bytes per record won’t mean much in terms of storage or memory use. However, these problems add up over multiple fields and millions of rows.
Your data pipeline should allow you to determine the freshness of your data by observing when it was last synced and the status of that sync. It should also allow you to set an interval for syncing. Without some kind of graphical user interface, you will be wholly dependent on your engineers to keep track of this information.
There are many ways for your data pipeline to be bottlenecked. If your data pipeline runs on cloud infrastructure (and it should), you will need to optimize the performance of your pipeline, queries, data warehouses and databases as your data traverses various networks. Ultimately, you will have to determine what cost and performance profile fits your needs and tune your system accordingly.
On the topic of security, there are many considerations in addition to the network traffic concerns listed above. Data storage, access control, column blocking and hashing, data governance, SAML access, regulatory compliance, and more all deserve attention.
No technology is infallible. When your data pipeline inevitably breaks down, your analysts will require the attention of engineers and other specialists for maintenance and support.
Your data needs will inevitably grow in scale as your use of data matures. You will need access to more hardware and resolve larger, more complicated versions of every problem listed above.
Extracting, loading and transforming data from databases should be painless and involve a minimum of human intervention. A homespun solution to data integration will, even in the best of cases, be painful and highly labor-intensive.
At Fivetran, we offer an automated, off-the-shelf solution to data integration. Consider a free trial or a demo, and see for yourself how you can circumvent all of the complications and challenges inherent in integrating and analyzing database data.