ETL: Snapshot CDC Source

This page is the shared reference for all Snapshot CDC source components. Each database has its own dedicated connector in the pipeline designer; pick the one that matches your connection below, then use this page for the full configuration, storage, change detection, and troubleshooting detail that applies to all of them.

Supported Databases

MySQL

PostgreSQL

SQL Server

Oracle

Oracle ADW

Amazon Redshift

Snowflake

Google Cloud SQL

Google Cloud Postgres

Heroku Postgres

AlloyDB

Vertica

Azure Synapse Analytics

IBM DB2

SAP HANA

Overview

Snapshot Change Data Capture (CDC) is a feature that allows you to track and process only the data that has changed between pipeline runs. Instead of processing your entire dataset every time, CDC identifies:

Upserted records: New rows that were inserted OR existing rows that were updated
Deleted records: Rows that existed in the previous run but no longer exist in the source table

This approach significantly reduces processing time and resource usage for large datasets where only a small percentage of data changes between runs.

CDC Database source component in the pipeline designer

How It Works

Snapshot CDC works by maintaining a snapshot of your data from the previous pipeline run. On each subsequent run, the system compares the current data against this snapshot to identify changes. The snapshot can be stored in one of two ways, depending on the Snapshot Storage method you select:

Database: The snapshot is stored as a table in the same database as your source data. This option is only available for SQL Server connections.
File Based: The snapshot is stored as a Parquet file on cloud storage managed by Integrate.io. This option is available for all database connection types (MySQL, PostgreSQL, SQL Server, Snowflake, and others).

Pipeline Output

When you configure a CDC source component, it produces two separate outputs:

CDC component with upserted and deleted record output paths

Upserted records - Contains all new and modified rows
Deleted records - Contains rows that were removed from the source table

You can route these outputs to different destinations or process them with different logic as needed. See Building a mirror destination from CDC outputs for the recommended pattern when you want a destination table to be a true mirror of the source (including deletes).

Building a mirror destination from CDC outputs

The CDC source emits two streams. What those streams mean for your destination depends on what you want the destination to look like.

Recommended pattern: a true mirror table

To make a destination table track every insert, update, and delete in the source, wire three destination components to the CDC source:

Component	Reads from	Target	Operation	Key
Mirror Upsert	Upserted records	`mirror_table`	Merge	Primary key
Mirror Delete	Deleted records	`mirror_table`	Delete	Primary key
Audit (optional)	Deleted records	`deleted_audit_table`	Append	(none)

After every run, mirror_table is an exact replica of the source table. The optional deleted_audit_table keeps a tombstone log of every removed row.

The Mirror Delete component is only safe when the Deleted records output contains only truly-deleted rows. That is the case for Primary Key mode and for the hybrid Composite + Primary Key mode. Do not add this component when using Composite mode without a Primary Key, because the Deleted records output contains “ghost” rows for every UPDATE, and deleting them from the mirror would undo your updates. See the Composite Hash Method section below.

Common variations

Upsert-only sink (no deletes): wire only the Mirror Upsert component. Deleted rows in the source will accumulate in the destination indefinitely.
Audit log only: wire only the Audit component if you only care about “what disappeared”.
Append-only event log: wire both outputs to append-only tables (no Merge, no Delete). This gives you a full change history.

Snapshot Storage

Snapshot CDC supports two storage methods for maintaining the snapshot between runs. The storage method is selected in the Snapshot Storage toggle in the component configuration.

Database Storage (SQL Server only)

The snapshot is stored as a table directly in your source database. This option is only available for SQL Server connections and requires write access to the source database.

A snapshot table is automatically created in the same schema as the source table
Change detection queries (upsert and delete) are executed entirely within the database using SQL
Requires CREATE TABLE and INSERT/DELETE permissions on the source database

File Based Storage

The snapshot is stored as a Parquet file on cloud storage managed by Integrate.io. This option is useful when you do not have write access to the source database, or when you prefer not to create additional tables in your source system.

No write access to the source database is required; only read access is needed
The snapshot is stored as a Parquet file on Integrate.io’s managed S3 storage
Change detection is performed by loading the current data and the previous snapshot, then comparing them in the pipeline
On the first run, when no previous snapshot exists, all current records are treated as upserted (new)
The snapshot file is automatically overwritten after each successful pipeline run

When to use File Based storage:

You are using a non-SQL Server database (MySQL, PostgreSQL, Snowflake, etc.), File Based is the only snapshot storage option for these connection types
Your database user has read-only access to the source database
You do not want to create snapshot tables in the source database
Corporate policy prohibits writing additional tables to the production database

Availability by Connection Type

Connection Type	Database Storage	File Based Storage
SQL Server	Yes	Yes
MySQL	No	Yes
PostgreSQL	No	Yes
Snowflake	No	Yes
Other databases	No	Yes

Change Detection Methods

Snapshot CDC offers two methods for detecting changes. Both methods work with either snapshot storage option (Database or File Based). Note that Database storage is only available for SQL Server connections; for all other database types, File Based storage is used.

1. Primary Key Method

Primary Key change detection method configuration

The Primary Key method uses a unique identifier column to track changes. How it works:

The system compares rows between the current table and the snapshot using the specified primary key

A row is considered upserted if:

It exists in the current table but not in the snapshot (new record)
It exists in both tables but any column value has changed (updated record)

A row is considered deleted if:

It exists in the snapshot but not in the current table

Best for:

Tables with a reliable unique identifier (e.g., id, customer_id, order_number)
Standard database tables with primary key constraints

Configuration:

Select Primary Key as the Change Detection Method

Choose the primary key column from the dropdown (e.g., id)

With File Based storage: When using File Based snapshot storage with Primary Key method, the system generates an MD5 hash of all non-primary-key columns to efficiently detect which rows have changed. Rows are matched by primary key, and the hash is used to determine whether a row has been updated.

2. Composite Hash Method

Composite Hash change detection method configuration

The Composite Hash method creates a hash value from all (or selected) columns to detect changes. This is useful when your table doesn’t have a unique identifier. How it works:

The system generates a hash value (like a fingerprint) for each row based on column values

The hash is stored in the snapshot alongside the row data

On subsequent runs, hashes are compared to detect changes:

A row is upserted if its hash doesn’t exist in the snapshot
A row is deleted if its hash exists in the snapshot but not in the current data

Best for:

Tables without a primary key
Tables where you want to detect changes based on specific columns only
Scenarios where the primary key might change between runs

Without a Primary Key, every UPDATE in your source appears in BOTH the upserted and deleted outputs.Composite Hash mode without a Primary Key has no way to recognise that an old hash and a new hash belong to the same row. From CDC’s point of view, an UPDATE is a DELETE (old hash) followed by an INSERT (new hash).If you intend to route the Deleted records output to a destination that performs DELETE operations (for example, a Mirror Delete component), you must specify a Primary Key (see step 4 below). Otherwise the Deleted records output will contain a copy of every updated row in addition to the truly-deleted ones, and using it to delete from a mirror table will revert your updates.

Configuration options:

Select Composite as the Change Detection Method

Use custom columns for composite hash (optional):

If unchecked: All columns are used to generate the hash
If checked: You can select specific columns for hash generation. This is useful for ignoring noise updates to bookkeeping columns (e.g. last_updated_at, synced_at) while still detecting business-meaningful changes

Composite key column (Database storage / SQL Server only): Select a column that will help identify whether a changed hash represents an update or a deletion. If not set, rows with changed hashes will appear in both upserted and deleted outputs. This option is not available with File Based snapshot storage.

Primary key (highly recommended when available):

When set, CDC pairs old and new hashes by Primary Key, so updates appear only in the upserted output and the deleted output contains only truly-deleted rows. This is the recommended hybrid configuration for production
When NOT set, every UPDATE produces a row in both the upserted output (new hash) AND the deleted output (old hash). See the warning above

Composite + Primary Key with no custom hash columns is functionally identical to Primary Key mode. The composite hash adds no value when it spans every column. To get value from Composite mode while keeping a Primary Key, also enable Use custom columns for composite hash and pick the business columns that should trigger updates.

With File Based storage: The Composite Hash method works the same way with File Based storage. The hash values and row data are stored in the Parquet snapshot file instead of a database table. All hash computation and comparison is performed in the pipeline. Note that the Composite key column option is not available with File Based storage; use the Primary key field instead to distinguish between updates and deletions.

Choosing a Configuration

Use the table below to pick a configuration based on your source table and what you want from the deleted output:

Your situation	Recommended mode	Why
Have a stable Primary Key and want every column change detected	Primary Key	Simplest config, clean upsert and delete streams
Have a stable Primary Key and want to ignore changes to some columns	Composite + Primary Key + custom hash columns (recommended)	Filters out noise (e.g. `last_synced_at`) while keeping the delete stream clean
No Primary Key exists	Composite (no Primary Key)	Works, but the deleted output will include “ghost” rows for every update. Do not wire it to a Delete-by-key destination
Have a Primary Key but selected Composite without custom hash columns	Equivalent to Primary Key mode	Composite + Primary Key + all-columns hash is functionally identical to Primary Key mode

Configuration Steps

Create a Database (CDC) Source component and connect it to your database

Select CDC mode: In the component configuration, CDC mode is automatically enabled for the Database CDC Source component type

Choose your data source:

Table mode (default): Select the schema and table you want to track changes for. Optionally add a Where clause to filter rows.
Query mode (File Based only): Switch the Source Mode to Query and write a custom SQL query. This allows JOINs across multiple tables, complex filtering, and column aliases. After writing the query, click Refresh Fields to load the available columns.

Select a Change Detection Method:

Primary Key: For tables with unique identifiers
Composite: For tables without primary keys or when you need hash-based detection

Configure the detection parameters based on your chosen method

Select Snapshot Storage:

Database: Stores the snapshot as a table in your source database. Requires write access. Only available for SQL Server connections.
File Based: Stores the snapshot as a Parquet file on Integrate.io managed storage. No write access to the source database is needed. Available for all database connection types.

Select input columns which will be processed by child components

Connect outputs: Route the “Upserted records” and “Deleted records” outputs to your desired destinations

Snapshot Storage Details

Database Snapshot Table (SQL Server only)

When using Database snapshot storage (available only for SQL Server connections), the system automatically creates and maintains a snapshot table in your database:

Method	Snapshot Table Name
Primary Key	`{table_name}_integrate_io_snapshot`
Composite Hash	`{table_name}_integrate_io_snapshot_composite`

Important notes:

The snapshot table is created automatically on the first run
The snapshot is updated after each successful pipeline run
Do not modify or delete the snapshot table manually, as this will affect change detection accuracy
The snapshot table uses the same schema as your source table

File Based Snapshot

When using File Based snapshot storage, the snapshot is stored as a Parquet file on cloud storage managed by Integrate.io. The file is stored at a path unique to your account, package, and source table, so each CDC component maintains its own independent snapshot.

File Based Snapshot: Query Mode

File Based snapshot storage with Query mode enabled

When using File Based snapshot storage, you can switch the Source Mode from Table to Query. This allows you to write a custom SQL query as the data source for CDC, instead of selecting a single table. Why use Query Mode:

You need to JOIN multiple tables and track changes across the combined result
You need complex WHERE clauses, aggregations, or transformations applied before change detection
You want to select only a subset of columns from one or more tables
You need to rename columns using aliases (e.g., c.name AS customer_name)

How it works:

Set the Source Mode toggle to Query

Write your SQL query in the query editor

Click Refresh Fields to load the columns returned by your query

Select the Primary Key column from the returned fields (the key must uniquely identify each row in your query result)

Important:The snapshot is identified by the component ID rather than a table name. This means each CDC component using Query Mode maintains its own independent snapshot, even if two components use the same query.

Important notes:

No tables are created in your source database
The snapshot file is created automatically on the first run
The snapshot file is overwritten after each successful pipeline run with the latest data
Snapshot files are managed automatically and do not require manual maintenance
If you change the source table, schema, or column selection, the existing snapshot will be used for comparison against the new data, which may cause all rows to appear as upserted (and previous rows as deleted) on the first run after the change

Example Use Cases

Use Case 1: Order Processing (SQL Server)

Track new and updated orders to sync with a data warehouse:

Connection: SQL Server
Source table: orders
Method: Primary Key
Primary key: order_id
Snapshot Storage: Database
Upserted records: Send to data warehouse for processing
Deleted records: Mark as cancelled in the warehouse

Use Case 2: Product Catalog Sync (SQL Server)

Sync product changes to an e-commerce platform:

Connection: SQL Server
Source table: products
Method: Composite Hash (no reliable primary key)
Custom columns: sku, name, price, description
Snapshot Storage: Database
Upserted records: Update product listings
Deleted records: Remove from catalog

Use Case 3: Customer Data Updates (MySQL)

Track customer information changes for GDPR compliance:

Connection: MySQL
Source table: customers
Method: Primary Key
Primary key: customer_id
Snapshot Storage: File Based
Upserted records: Log changes for audit trail
Deleted records: Process data deletion requests

Use Case 4: Read-Only SQL Server CDC

Track changes on a SQL Server database where you only have read access:

Connection: SQL Server
Source table: transactions
Method: Primary Key
Primary key: transaction_id
Snapshot Storage: File Based
Upserted records: Load into analytics warehouse
Deleted records: Flag as reversed in the warehouse

Use Case 5: Multi-Table CDC with Query Mode

Track changes across a JOIN of orders, customers, and products:

Connection: MySQL (or any supported database)
Source Mode: Query

Query:

SELECT o.order_id, c.name AS customer_name, p.name AS product_name,
       o.quantity, o.total_price, o.status
FROM orders o
JOIN customers c ON c.customer_id = o.customer_id
JOIN products p ON p.product_id = o.product_id
WHERE o.status IN ('pending', 'shipped')

Method: Primary Key
Primary key: order_id
Snapshot Storage: File Based
Upserted records: Send enriched order data to the warehouse
Deleted records: Archive cancelled or completed orders

Use Case 6: PostgreSQL CDC

Sync data from a PostgreSQL database (File Based storage is required since Database storage is not available for PostgreSQL):

Connection: PostgreSQL
Source table: inventory
Method: Composite Hash
Custom columns: sku, quantity, warehouse_id
Snapshot Storage: File Based
Upserted records: Update inventory management system
Deleted records: Remove discontinued items

Use Case 7: True mirror of a production table

Keep a destination table that exactly mirrors the source, ignoring noise updates to bookkeeping columns. This is the canonical production pattern, combining the hybrid Composite + Primary Key mode with the three-destination wiring described in Building a mirror destination from CDC outputs:

Connection: PostgreSQL (or any supported database)
Source table: orders
Method: Composite (hybrid)
Custom hash columns: status, total, customer_id, fulfilled_at (business fields only)
Primary key: order_id
Snapshot Storage: File Based
Destinations:
- Mirror Upsert: Database Destination on the Upserted records output. Target analytics.orders_mirror, operation Merge, key order_id
- Mirror Delete: Database Destination on the Deleted records output. Target analytics.orders_mirror, operation Delete, key order_id
- Audit (optional): Database Destination on the Deleted records output. Target analytics.orders_deleted_log, operation Append

With this configuration, updates that only touch last_synced_at or other unhashed columns are correctly ignored, while inserts, updates to business columns, and deletes are all reflected in the mirror.

Best Practices

Choose the right detection method:
- Use Primary Key when you have a reliable unique identifier
  - Use Composite Hash when no primary key exists or when tracking changes to specific columns
Consider column selection for Composite Hash:
- Include only columns that matter for change detection
  - Exclude frequently changing but unimportant columns (e.g., last_modified_timestamp if you only care about data changes)
Choose the right snapshot storage:
- Use Database storage when you are using a SQL Server connection, have write access, and want the most efficient change detection (queries run entirely in the database)
  - Use File Based storage when you are using any non-SQL Server database, when you have read-only access, or when you do not want to create additional tables in the source database
Monitor snapshot size:
- For Database storage: The snapshot table grows with your source table. Consider periodic maintenance if storage becomes a concern.
  - For File Based storage: The Parquet snapshot file is managed automatically by Integrate.io.
Handle deleted records appropriately:
- Decide whether to hard-delete or soft-delete in your destination
  - Consider archiving deleted records for audit purposes
Test with small datasets first:
- Verify that change detection works as expected before running on production data
Query Mode best practices:
- Ensure your query result has a column that uniquely identifies each row, and use it as the Primary Key for accurate change detection
  - Do not end your query with a semicolon. The system appends processing logic and a trailing semicolon may cause errors
  - Use column aliases to avoid ambiguous names when joining multiple tables (e.g., c.name AS customer_name)
  - Avoid non-deterministic functions (e.g., NOW(), RAND()) in your query. They produce different values on each run, causing all rows to appear as changed
Avoid changing the Where clause or query between runs:
- Changing the Where clause or rewriting the query alters which rows are included in the comparison. This may cause all rows to appear as upserted and previous rows as deleted on the first run after the change.
Pair Composite mode with a Primary Key whenever one exists:
- Composite-only mode (no Primary Key) populates the Deleted records output with both true deletes and “ghost” rows for every update. The hybrid Composite + Primary Key mode is the recommended production configuration. See Choosing a Configuration.
To make a destination behave as a true mirror, wire two destinations to it:
- One Merge-by-PK destination on Upserted records, and one Delete-by-PK destination on Deleted records, both targeting the same mirror table. See Building a mirror destination from CDC outputs.

Troubleshooting

Why are all records showing as upserted on the first run?

This is expected behavior. On the first run, there’s no snapshot to compare against, so all current records are treated as new (upserted). This applies to both Database and File Based snapshot storage.

Why are some updated records appearing in both upserted and deleted outputs?

You are using Composite Hash mode without a Primary Key. In that mode, every UPDATE in the source appears in both outputs (the new hash in upserted, the old hash in deleted). Specify a Primary Key in the Composite configuration to fix this. See the warning in the Composite Hash Method section for the full explanation.

The snapshot table wasn't created. What happened?

This applies to Database snapshot storage (SQL Server only). Ensure your database user has CREATE TABLE permissions on the target schema. Check the job logs for any error messages. If you do not have write access to the database, or you are using a non-SQL Server connection, use File Based snapshot storage instead.

Can I use CDC with multiple tables?

Yes, add a separate CDC source component for each table you want to track.

Why don't I see the Database snapshot storage option?

Database snapshot storage is only available for SQL Server connections. If you are using MySQL, PostgreSQL, Snowflake, or another database type, only File Based snapshot storage is available.

Can I switch between Database and File Based snapshot storage?

Yes, but switching storage methods will effectively reset the snapshot. On the first run after switching, all current records will appear as upserted because the new storage location has no previous snapshot to compare against.

I changed my Where clause (or query) and now all records show as changed. Why?

Changing the Where clause or the source query alters which rows are included in the comparison. Rows that were previously included in the snapshot but are now excluded will appear as deleted, and rows newly included will appear as upserted. The first run after the change will reflect the difference between the old and new result set.

How do I load the fields for my query in Query Mode?

After writing your SQL query, click the Refresh Fields button next to the Primary Key dropdown. This executes the query against the database and populates the available columns. You can then select the appropriate primary key.

Can I use Query Mode with Database snapshot storage?

No. Query Mode is only available with File Based snapshot storage. If you need Database snapshot storage (SQL Server), use Table mode with a Where clause for filtering.

My query returns duplicate primary key values. What happens?

If your query produces rows with duplicate primary key values, change detection will not work correctly. Ensure that the column selected as the primary key is unique across all rows returned by your query. When joining tables, use the primary key from the “main” table (e.g., order_id from an orders table in an orders-customers-products JOIN).

Does File Based CDC require me to configure an S3 connection?

No. File Based snapshot storage uses Integrate.io’s managed cloud storage. No additional connection configuration is required on your part.

Limitations

Database storage: Only available for SQL Server connections. Snapshot table must remain in the same database schema as the source table. Requires write access to the database.
File Based storage: Available for all database connection types. No write access to the source database is required, but the pipeline performs the comparison (rather than the database), which may use more cluster resources for very large tables.
Large initial snapshots may take time to process on the first run
Query Mode: Only available with File Based snapshot storage. The query result must contain a column suitable for use as a primary key. Non-deterministic functions in the query will cause false positives in change detection.
Changing the Where clause, source query, or schema columns between runs may cause unexpected results on the first run after the change

​Supported Databases

MySQL

PostgreSQL

SQL Server

Oracle

Oracle ADW

Amazon Redshift

Snowflake

Google Cloud SQL

Google Cloud Postgres

Heroku Postgres

AlloyDB

Vertica

Azure Synapse Analytics

IBM DB2

SAP HANA

​Overview

​How It Works

​Pipeline Output

​Building a mirror destination from CDC outputs

​Recommended pattern: a true mirror table

​Common variations

​Snapshot Storage

​Database Storage (SQL Server only)

​File Based Storage

​Availability by Connection Type

​Change Detection Methods

​1. Primary Key Method

​2. Composite Hash Method

​Choosing a Configuration

​Configuration Steps

​Snapshot Storage Details

​Database Snapshot Table (SQL Server only)

​File Based Snapshot

​File Based Snapshot: Query Mode

​Example Use Cases

​Use Case 1: Order Processing (SQL Server)

​Use Case 2: Product Catalog Sync (SQL Server)

​Use Case 3: Customer Data Updates (MySQL)

​Use Case 4: Read-Only SQL Server CDC

​Use Case 5: Multi-Table CDC with Query Mode

​Use Case 6: PostgreSQL CDC

​Use Case 7: True mirror of a production table

​Best Practices

​Troubleshooting

​Limitations

Supported Databases

Overview

How It Works

Pipeline Output

Building a mirror destination from CDC outputs

Recommended pattern: a true mirror table

Common variations

Snapshot Storage

Database Storage (SQL Server only)

File Based Storage

Availability by Connection Type

Change Detection Methods

1. Primary Key Method

2. Composite Hash Method

Choosing a Configuration

Configuration Steps

Snapshot Storage Details

Database Snapshot Table (SQL Server only)

File Based Snapshot

File Based Snapshot: Query Mode

Example Use Cases

Use Case 1: Order Processing (SQL Server)

Use Case 2: Product Catalog Sync (SQL Server)

Use Case 3: Customer Data Updates (MySQL)

Use Case 4: Read-Only SQL Server CDC

Use Case 5: Multi-Table CDC with Query Mode

Use Case 6: PostgreSQL CDC

Use Case 7: True mirror of a production table

Best Practices

Troubleshooting

Limitations