According to a study by Seagate, only 32% of data available to enterprises is put to work. The remaining 68% is unleveraged. One of the challenges noted is: making the different silos of collected data available. Using automation to bring together figures from disparate systems helps leaders make confident and reliable decisions backed by real-time information. This overview discusses how to use CDC Change Data Capture to enable real-time analysis.

Table of Contents

How Does CDC Work in AWS S3?

Amazon Simple Storage Service (S3) is a cloud-based storage service. S3 makes your data available from any location. As it is a cloud-based service, companies can see improved scaling, availability, security, and performance.

The premise of S3 is the concept of buckets. Buckets are containers for objects also sometimes referred to as files. Buckets contain the files and the metadata about the files. To store information in S3 developers upload files to the appropriate buckets. Developers can set permissions for each bucket.

Integrate.io offers integrations that allow you to connect AWS S3 buckets to other data sources through S3 CDC.

Prerequisites for Using AWS S3 as a Target

AWS includes the Data Migration (DMS) service for using Amazon S3 as a target. There are three requisites before developers can get started:

Location of S3 Bucket

The S3 bucket you are using for the AWS region must reside in the same region as the DMS instance you are using for migration.

IAM Role Requirements

Identity Access Management (IAM) roles are used to assign permissions to accounts to determine access to the system.

Specific IAM rules include:

  • The account used for the migration has the IAM role with write and delete permissions
  • The role must have tagging so that any objects written to the target can be tagged
  • The IAM role is added as a trusted entity

CDC and Transaction Order

Transaction order in S3 change data capture refers to how the system writes changes to the logs. The two methods are:

CDC Without Transaction Order

By default, changes in AWS DMS are not logged in order of the transaction. Instead, it stores all changes in one or more files for each table. AWS creates directories on the target database to store the changes coming from the source.

Capturing Changes With Transaction Order

AWS DMS can be configured to store transactions in order. This approach requires setting S3 endpoint settings. These settings direct DMS to store the changes in .csv files. These files contain all row changes listed by transaction order. 

Using Integrate.io’s no-code tools, you can quickly build integrations that use several AWS functions such as Amazon AuroraAmazon RDS, and Amazon Redshift.

Using AWS Data Migration Service (DMS) for CDC

AWS S3 change data capture includes 4 steps which are,

  1. Schema conversion
  2. Configuring replication instances
  3. Specifying database endpoints 
  4. Creating database migration tasks

As a first step for S3 CDC, log in to your AWS account and navigate to AWS DMS from the search bar. You will be taken to the DMS window and you can get started!

  1. Using the schema conversion feature in DMS

The schema represents the logical configuration of a database. The schema from the source database must be converted to that of the target. This ensures the database configurations match so the information can be updated successfully. This can be done in multiple ways. 

You can either use AWS Schema Conversion Tool (It supports Linux, Ubuntu, and Windows except MacOS unfortunately). Or, the schema conversion feature in AWS DMS. The feature allows heterogeneous database migrations using a web-based interface. There are three components to it which can be set up in that order.

  • Create instance profile
  • Data providers
  • Migration project

First goes ‘Create instance profile’. Here, S3 is the target destination. Provide all the details and make sure you turn on the right version of S3 bucket.

thumbnail image

Next step in AWS CDC DMS is to describe your source and destination. Go to the ‘Data providers’ section for this. Remember, DMS doesn’t store your database credentials.

thumbnail image

Click on ‘Create data provider’. You will be directed to the main dashboard. Revisit and provide information for the target provider as well. Once that’s done, the next step is to create ‘Migration projects’. 

thumbnail image

Once you create the migration task, it will take some 10 to 15 minutes initially to create the project. And, you are good to go! Instead of this, if you are using AWS SCT, install it and just follow the steps in the interface. 

Note: The above mentioned steps can be used when you want to do the schema conversion (When you are migrating to a different database engine, or there are some differences between source and destination schema etc.)

2. Configure Replication Instance

thumbnail image

  • Select ‘Create replication instance’ at the bottom.

3. Specify Database Endpoints

The endpoints specify connection information about the data store. The endpoints also specify datastore type and location information. One endpoint must be an AWS service. Thus, you can’t migrate from one on-premise data store to another on-site data store.

To configure the end points,

  • Click on the ‘End points’ menu on the left side navigation bar
  • Now, you can provide all the information about your source including the source engine. 

thumbnail image

  • Once you fill in all the fields, it creates the end points of the data migration.

4. Create Replication Tasks

The replication tasks migrate the data from the source to the target. You must specify a replication instance the task will use.

thumbnail image

You have seen the AWS CDC DMS example. Now, let's understand some limitations of AWS DMS full load and CDC limitations.

Limitations of AWS DMS

AWS DMS has the following limitations:

  1. The databases supported are limited
  2. The ‘Schema conversion’ feature can’t carry out complex transformations. You will need AWS SCT or use ETL tools for this.
  3. The customization for modifying and mapping data is limited.
  4. It can’t be used for large scale data migrations as there will be significant latency.

Best Practices for Using AWS Data Migration Service for CDC Change Data Capture

Despite the benefits of S3 CDC, it can quickly cause problems if the process isn’t configured properly. Below are a few best practices to follow to minimize issues.

Use Row Filtering When Handling Large Tables

Filtering rows to find updates on large tables could negatively affect performance during the process. To improve performance, break the process into multiple tasks.

Reduce Load on Source Database

An AWS DMS full load task performs a full table scan of the source table. The full load task also runs queries to locate changes to apply to the destination database. Running a table scan and queries could affect performance. To minimize these issues, limit the number of tasks or tables for the migration.

Removing Bottlenecks on Target Database

There may be processes running on the target database that competes with the migration. Turn off unnecessary triggers and secondary indexes for the first load. You can turn them back on for ongoing migrations

Frost & Sullivan's research shows that almost a quarter of IT decision-makers say that automation is one of the top technologies they use to reduce costs and positively affect the bottom line. Automation empowers leaders to assess new market opportunities and make strategic decisions. CDC Change Data Capture is a valuable tool in gathering these insights.

You have seen how to carry out S3 CDC and how AWS DMS helps for the same. But, as DMS can't help for large scale applications that required data modifications, Integrate.io might be a good fit for you.

How Integrate.io Can Help

The Integrate.io data integration platform enables you to bring together figures from disparate systems to supply key insight into the business. If you’d like to try these integrations firsthand get in touch with our team and experience the Integrate.io platform for yourself.