Overview
Requirements
- Billing must be enabled on the Google Cloud project.
- Ensure you have permission to create a service account with the
BigQuery Data Editor
role on the project.
Setting up BigQuery for ELT & CDC
Create service account
- Go to your project's service accounts page and click
+ CREATE SERVICE ACCOUNT
. - Enter the
Service account name
and add a description.
ClickCREATE AND CONTINUE
. - Select the
BigQuery Data Editor
role. - Select the
BigQuery Job User
role.
ClickCONTINUE
. - Click
DONE
. - The service account should have been created successfully. Find it in the service accounts list and click it.
- Go to the
KEYS
tab. - Click
ADD KEY
and selectCREATE NEW KEY
. - Select
JSON
and clickCREATE
. - A JSON file should have been downloaded to your system. Upload it.
We will fetch theclient_email
andprivate_key
properties from the JSON credential file.
Limitations
BigQuery doesn’t support deduplication. Instead, the following columns are added to the destination table for future deduplication on the customer’s side.
__offset - Offset column. Keeps track of the offset in incremental order to determine which existing row is the latest.
__op - Operation column. Tells what kind of operation is performed for a particular row (0 - Created, 1 - Inserted, 2 - Updated, 3 - Deleted)
This query is to list the de-duplicated records:
SELECT customer.*
FROM (SELECT id,
ARRAY_AGG(
t ORDER BY t.__offset DESC LIMIT 1
)[OFFSET(0)] customer
FROM `euphoric-effect-296205.northwind.customer` t
GROUP BY id);
This query is to upsert a staging table that Integrate.io is syncing to, into a production table used for querying:
MERGE INTO `northwind`.customers o
USING (SELECT event.*
FROM (SELECT customer_id,
ARRAY_AGG(
t ORDER BY t.__offset DESC LIMIT 1
)[OFFSET(0)] event
FROM customers_staging_1 t
GROUP BY customer_id)) as s
ON o.customer_id = s.customer_id
WHEN MATCHED AND s.__op = 3 THEN
DELETE
WHEN MATCHED THEN
UPDATE SET o.company_name = s.company_name, o.__op = s.__op
WHEN NOT MATCHED THEN
INSERT ROW
After deleting tables on BigQuery, we have to wait (5+ minutes) to let the change propagate to all the machines in the BigQuery cluster. Otherwise, it might lead to some rows missing. This happens on resync requests. Please contact support@integrate.io to assist with the resync timing of user tables.