The strong rise of data products in today’s world has made companies introduce new best practices and stricter Service Level Agreements (SLAs) due to their critical functions. Whether these are internal or external-facing data products, experiencing downtime due to data replication issues is a major concern. 

In the ideal world, there would be no data replication issues, but in reality, they can occur for various reasons, which we’ve outlined below. The most important thing when data downtime does occur is how fast the issue can be resolved and the data product is back up and fully functioning. 

Working with several clients using our platform to replicate data from their databases to their data warehouses for powering data products, we were able to significantly reduce database re-sync times. So much so that when we tested other tools in the market, we realized that we could actually provide the fastest resync times on the market. 

The following blog details the rise of data products, their architecture, why resync time is so important, and a comparison of resync times across various data replication tools. 

Key Takeaways

  • Understand what data products are and how they are architected
  • Learn about the important role database replication plays in data products
  • Compare the resync times of the top CDC data replication platforms

Understanding Data Products 

A data product is a system or service built to leverage data in a way that directly generates value for end users. Examples include:

  • Customer-facing Dashboards: Visualizing real-time metrics for end users.

  • Recommendation Engines: Delivering product or content suggestions.

  • Monitoring and Alerts: Tracking events and triggering real-time notifications.

  • Data APIs: Exposing up-to-date data to external partners or applications.

A data product's key characteristic is its dependency on up-to-date, accurate data. This requirement means that any delay in the availability of fresh data can reduce the product’s effectiveness and, in customer-facing use cases, damage user trust.

Architecture of a Data Product

thumbnail image

The typical architecture of a data product looks like:

  1. Production Database: The source of truth for live transactional data (e.g., AWS MySQL Aurora). This database often supports business-critical applications, and querying it directly can slow down operations.

  2. Data Warehouse: A high-performance environment (e.g. Snowflake) consolidating data from various sources. Data warehouses are optimized for analytical queries, providing the backbone for data products.

  3. Data Pipeline Platforms: Tools like Integrate.io facilitate replication from sources such as the production database to the data warehouse.

  4. Data Product Layer: This layer consumes data from the warehouse and can include analytics tools, machine learning models, and interactive dashboards.

The Role of Data Replication for Data Products

A data product’s success is largely based on the reliable availability of real-time data. While the data for data products can come from various sources, the most common source is a company’s transactional database.

In an ideal world, the data product could sit directly on top of the transactional database and receive the relevant data. However, this is not an option due to the performance impact it causes on the database and the fact that transactional databases are not powerful enough to power a data product. 

This is where the data warehouse fits in. It provides a high-scalable analytical database that can be queried at massive scale without any impact on performance. The final piece of this data product architecture is replicating the data from the transactional database to the data warehouse. A critical requirement for this is being able to replicate the data in real-time as the end users of your data product expect the data to be up to date. 

Change Data Capture (CDC) is the process used when replicating data from transactional databases to data warehouses. By using changelogs to perform incremental data replication, CDC significantly reduces the performance impact of transactional databases. 

While database data replication sounds straightforward (“move this data from A to B”), it is actually a very nuanced technology that plays a critical role in data products. That is why there are many companies whose sole focus is on providing this service to companies today. 

The Importance of Data Replication Resync Times

Database data replication can be broken into two core parts: initial sync and continuous sync. The initial sync is the historical load that moves all the data in the database to the data warehouse. Depending on the size of the database and the replication method/technology being used, this can take anywhere from hours to days. Once the initial sync is complete, the continuous sync begins. This uses CDC to check for any changes on the source database and then replicate those to the data warehouse. 

Consumers of data products typically expect the data on display to be up to date, so it's crucial to have as low latency as possible with the continuous sync. Most CDC platforms in the industry have 5 - 15 minute replication times, while Integrate.io supports getting down to as low as every 60 seconds. (This is one of the reasons why it’s the replication tool of choice for companies building data products.)

In a perfect world, the continuous sync would continue replicating data in a timely manner and without any issues. While this is true the majority of the time, there are times when the source database and the data warehouse get out of sync and a resync is needed. 

Typical reasons for the database and data warehouse to get out of sync include:

  1. Network Latency and Interruption: network issues are a leading cause of replication lag and data inconsistency. High latency, intermittent connectivity, or unstable network conditions can delay or prevent the timely capture of changes, leading to out-of-sync data.

  2. Source Schema Changes: while most modern CDC platforms can handle most source schema changes, there are scenarios where a resync may be required. Schema mismatches prevent CDC from properly capturing updates and often require a full resync to ensure both source and target databases are aligned.

  3. High Volume in Change Log: high volumes or spikes can overload the change log, especially in systems with limited retention settings. If the log overflows or can't keep up with data modifications, CDC might miss updates, necessitating a resync to re-establish accuracy.

  4. Resource Constraints: Insufficient CPU, memory, or disk space on either the source or target can throttle CDC processes. Resource limitations, especially during peak periods, slow or prevent data capture, causing gaps in replication that require a resync.

A resync involves doing a full historical load of the necessary data sets to essentially restart the replication process so the database and data warehouse can get back in sync. While the initial sync time when starting a replication process isn’t business critical as your data product isn’t being used yet, resync times are business critical as these occur when your data product is live and being used by consumers.

Companies using data products for business-critical use cases establish SLAs for their data products. A core component of these SLAs is the recovery time to get the data product back up with current data should a resync be needed.  

For external-facing data products, even a brief period of downtime can lead to customer dissatisfaction or lost revenue. For internal applications, replication issues can disrupt operations, delay decision-making, and create bottlenecks in business processes.

Comparing Resync Times Across the CDC Industry

We benchmarked six CDC platforms using two sample databases to compare resync times. Our goal was to measure the resync times under realistic conditions and to note any differences as the database size increased.

  1. Small Database:

    1. Type: AWS MySQL Aurora

    2. Size: 10 million rows, [16 columns]

    3. Connectivity method: Direct

  2. Large Database:

    1. Type: AWS MySQL Aurora

    2. Size: 300 million rows, [7 columns]

    3. Connectivity method: Direct

Replication details:

  • Source database: AWS MySQL Aurora

  • Target data warehouse: Snowflake on AWS

  • Replication method: Direct connectivity for both small and large datasets.

  • Measurement: Full sync times were recorded for each platform under identical conditions.

Platforms tested:

  • Integrate.io

  • Fivetran

  • Hevo

  • Matillion

  • Rivery

  • Estuary

Resync Times Comparison Table

Platform

Small Database

Large Database

Integrate.io

3m 44s

19m 17s

Matillion

5m

39m

Fivetran

5m 29s

49m 39s

Rivery

21m 42s

13h 42m 53s

Hevo

22m

10h 47m

Estuary

42m

3h

The results show that Integrate.io outperforms competitors in data replication resync times for both small and large databases, demonstrating superior scalability and efficiency. 

For small databases, Integrate.io completes replication in 3 minutes 44 seconds, significantly faster than the next closest competitor, Matillion, at 5 minutes. This advantage scales even more prominently with large databases, where Integrate.io finishes in 19 minutes 17 seconds, compared to Matillion's 39 minutes and Fivetran's 49 minutes 39 seconds. Notably, other platforms like Rivery, Hevo, and Estuary exhibit much slower replication resync times, especially with large databases, indicating that they may struggle with scaling for larger workloads. 

Overall, Integrate.io’s performance shows why it is a leading choice for organizations building mission-critical data products that have strict SLAs for recovery times. 

A detailed overview of the exact tests run and results on each platform can be found here

Case Study on Minimizing Resync Times With Integrate.io 

A leading player in the event ticketing industry faced performance challenges in maintaining real-time inventory data for their ticketing platform. With Integrate.io’s solution, they were able to achieve real-time data replication at scale and reduce their resync times significantly. This improvement translated directly into better customer experiences and minimized the risk of lost sales and customer trust due to data lags or downtime.

Data Challenges

The company has been a long-term Integrate.io customer for real-time database replication to power its data products. However, it still had a number of legacy data replication pipelines running on in-house-built scripts and infrastructure that it wished to migrate to Integrate.io. These in-house pipelines were taking up too much engineering team time to maintain and update.

The team had no concerns about Integrate.io’s ability to handle real-time replication at scale given the long-term use of the platform, but their company now had strict SLA times for resyncs should they need to be performed. The introduction of these strict SLA times was a result of the rise in the use of data products at this company and the critical roles these products played in customer-facing experiences.

Initial resync testing on Integrate.io’s CDC product did not meet the times required by their SLAs. 

The Solution

Through a combination of Integrate.io’s ETL and CDC products, companies can achieve industry-leading resync times and real-time (60-second) replication. Integrate.io’s ETL product is built for big data processing which means it’s highly scalable and can process massive datasets in parallel. Once this sync is complete on ETL, Integrate.io’s CDC platform kicks in and can replicate the ongoing data changes to the data warehouse in real-time.

Key Results

Our customer was able to reduce their resync times significantly, undercutting their SLA required times to the delight of their teams. This now allows them to retire their legacy data pipelines, recovering their engineering team many days per month to focus on building core product.

Conclusion

Integrate.io's achievements in resync time optimization stand out as a game-changer for businesses relying on real-time data products. Our industry benchmarks demonstrate that Integrate.io has the fastest resync time ensures minimal downtime and faster recovery in case of replication issues that require a resync.

While no companies want to have to do resyncs, and data replication providers must do their part to ensure the uptime of their customers' data pipelines, the reality is that, for reasons outlined above, occasions do occur where a full or partial resync is needed. 

If you are one of the increasing number of companies building business-critical data products, you need strict SLAs on your data products' uptime and a data replication partner who can help you achieve the fastest resync times possible. Schedule a time to speak with an Integrate.io Solution Engineer today about your data replication requirements. 

About Integrate.io

Integrate.io is a leading provider of database replication for companies building business-critical data products. With its 60-second real-time Change Data Capture (CDC) replication and industry-leading resync speeds, Integrate.io ensures your data products have real-time data and maximum uptime. Companies serious about data products, need serious data replication partners.

Sign up now for a 14-day free trial to unlock the full potential of your data product with Integrate.io!

Frequently Asked Questions

1. What is a data product?

A data product is a digital tool or solution that uses data to deliver insights, predictions, or automation. Examples include dashboards, recommendation systems, and predictive models. Designed for specific user needs, data products turn raw data into actionable outcomes, helping businesses make better decisions and drive innovation.

2. What is the typical data product architecture?

A typical data product architecture involves a production database where raw data is generated, a Change Data Capture (CDC) process to replicate data into a centralized data warehouse, and the data product sitting on top of the warehouse. This setup ensures scalable, reliable access to cleaned and processed data for analytics, machine learning, or user-facing applications.

3. Is Integrate.io historical load free?

Yes, Integrate.io offers free historical data loads for all initial syncs and resyncs.