Using components: CDC Database Source

Note: CDC functionality is currently implemented for SQL Server only. If you'd like to use this with another database source, please contact Support (support@integrate.io) to have this set up

Overview

Change Data Capture (CDC) is a feature that allows you to track and process only the data that has changed between pipeline runs. Instead of processing your entire dataset every time, CDC identifies:

  • Upserted records: New rows that were inserted OR existing rows that were updated
  • Deleted records: Rows that existed in the previous run but no longer exist in the source table

This approach significantly reduces processing time and resource usage for large datasets where only a small percentage of data changes between runs.

How It Works

CDC works by maintaining a snapshot table in your database that stores the state of your data from the previous pipeline run. On each subsequent run, the system compares the current data against this snapshot to identify changes.

Pipeline Output

When you configure a CDC source component, it produces two separate outputs:

thumbnail image

  1. Upserted records - Contains all new and modified rows
  2. Deleted records - Contains rows that were removed from the source table

You can route these outputs to different destinations or process them with different logic as needed.

Change Detection Methods

CDC offers two methods for detecting changes:

1. Primary Key Method

thumbnail image

The Primary Key method uses a unique identifier column to track changes.

How it works:

  1. The system compares rows between the current table and the snapshot using the specified primary key
  2. A row is considered upserted if:
    • It exists in the current table but not in the snapshot (new record)
    • It exists in both tables but any column value has changed (updated record)
  3. A row is considered deleted if:
    • It exists in the snapshot but not in the current table

Best for:

  • Tables with a reliable unique identifier (e.g., id, customer_id, order_number)
  • Standard database tables with primary key constraints

Configuration:

  1. Select Primary Key as the Change Detection Method
  2. Choose the primary key column from the dropdown (e.g., id)

2. Composite Hash Method

thumbnail image

The Composite Hash method creates a hash value from all (or selected) columns to detect changes. This is useful when your table doesn't have a unique identifier.

How it works:

  1. The system generates a hash value (like a fingerprint) for each row based on column values
  2. The hash is stored in the snapshot table alongside the row data
  3. On subsequent runs, hashes are compared to detect changes:
    • A row is upserted if its hash doesn't exist in the snapshot
    • A row is deleted if its hash exists in the snapshot but not in the current data

Best for:

  • Tables without a primary key
  • Tables where you want to detect changes based on specific columns only
  • Scenarios where the primary key might change between runs

Configuration options:

  1. Select Composite as the Change Detection Method
  2. Use custom columns for composite hash (optional):
    • If unchecked: All columns are used to generate the hash
    • If checked: You can select specific columns for hash generation
  1. Composite key column: Select a column that will help identify whether a changed hash represents an update or a deletion. If not set, rows with changed hashes will appear in both upserted and deleted outputs.
  2. Primary key (optional): In composite mode, you can still specify a primary key to help distinguish between updates and deletions more accurately.

Configuration Steps

  1. Create a Database (CDC) Source component and connect it to your Server database
  2. Select CDC mode: In the component configuration, CDC mode is automatically enabled for the Database CDC Source component type
  3. Choose your source table: Select the schema and table you want to track changes for
  4. Select a Change Detection Method:
    • Primary Key: For tables with unique identifiers
    • Composite: For tables without primary keys or when you need hash-based detection
  5. Configure the detection parameters based on your chosen method
  6. Select input columns which will be processed by child components
  7. Connect outputs: Route the "Upserted records" and "Deleted records" outputs to your desired destinations

Snapshot Table

The system automatically creates and maintains a snapshot table in your database:

Method Snapshot Table Name
Primary Key {table_name}_integrate_io_snapshot
Composite Hash {table_name}_integrate_io_snapshot_composite

Important notes:

  • The snapshot table is created automatically on the first run
  • The snapshot is updated after each successful pipeline run
  • Do not modify or delete the snapshot table manually, as this will affect change detection accuracy
  • The snapshot table uses the same schema as your source table

Example Use Cases

Use Case 1: Order Processing

Track new and updated orders to sync with a data warehouse:

  • Source table: orders
  • Method: Primary Key
  • Primary key: order_id
  • Upserted records: Send to data warehouse for processing
  • Deleted records: Mark as cancelled in the warehouse

Use Case 2: Product Catalog Sync

Sync product changes to an e-commerce platform:

  • Source table: products
  • Method: Composite Hash (no reliable primary key)
  • Custom columns: sku, name, price, description
  • Upserted records: Update product listings
  • Deleted records: Remove from catalog

Use Case 3: Customer Data Updates

Track customer information changes for GDPR compliance:

  • Source table: customers
  • Method: Primary Key
  • Primary key: customer_id
  • Upserted records: Log changes for audit trail
  • Deleted records: Process data deletion requests

Best Practices

  1. Choose the right detection method:
    • Use Primary Key when you have a reliable unique identifier
    • Use Composite Hash when no primary key exists or when tracking changes to specific columns
  2. Consider column selection for Composite Hash:
    • Include only columns that matter for change detection
    • Exclude frequently changing but unimportant columns (e.g., last_modified_timestamp if you only care about data changes)
  3. Monitor snapshot table size:
    • The snapshot table grows with your source table
    • Consider periodic maintenance if storage becomes a concern
  4. Handle deleted records appropriately:
    • Decide whether to hard-delete or soft-delete in your destination
    • Consider archiving deleted records for audit purposes
  5. Test with small datasets first:
    • Verify that change detection works as expected before running on production data

Troubleshooting

Q: Why are all records showing as upserted on the first run?

A: This is expected behavior. On the first run, there's no snapshot to compare against, so all current records are treated as new (upserted).

Q: Why are some updated records appearing in both upserted and deleted outputs?

A: In Composite Hash mode without a primary key specified, when a row's hash changes, the system cannot determine if it's an update or a delete/insert combination. Specifying a primary key or composite key column helps resolve this.

Q: The snapshot table wasn't created. What happened?

A: Ensure your database user has CREATE TABLE permissions on the target schema. Check the job logs for any error messages.

Q: Can I use CDC with multiple tables?

A: Yes, add a separate CDC source component for each table you want to track.

Limitations

  • Currently supported for SQL Server only
  • Cannot be used with query-based sources (table selection required)
  • Snapshot table must remain in the same database schema as the source table
  • Large initial snapshots may take time to create on first run