Using components: Google BigQuery Destination

Use the Google BigQuqery destination component to store the output of a data flow in a BigQuery table. The destination component stores the data intermediately into Google Cloud Storage and then uses BigQuery's API to import the data into it.

Connection

Select an existing Google BigQuery connection or create a new one (for more information, see Allowing Integrate.io ETL access to my Google BigQuery dataset.) 

Destination Properties 

  • Target table - the name of the target table in your BigQuery dataset. By default, if the table doesn't exist, it will be created automatically.
  • Automatically create table if it doesn't exist - if unchecked and the table doesn't exist, the job fails.
  • Automatically add missing columns - when checked, the job will check if each of the specified columns exist in the table and if one does not exist, it will add it. Note: Uncheck this if your target is a partitioned table (See below).

Inserting data into partitioned tables

Data can be inserted into a partitioned table by specifying the partition to insert into using the notation table$partition (e.g. events$20160405).

Operation type

Append (Insert only) - default behaviour. Data will only be appended to the target table

Overwrite (Truncate and insert) - existing table will be deleted and then the table will be recreated with newly inserted data

Merge with existing data - incoming data is merged with existing data in the table. Requires setting the merge keys correctly in field mapping. Merge is done in the following manner:

  • A staging table is created and the data flow's data is stored in it.
  • The target table is copied onto a new table (target-copy) and is cleaned up.
  • Data that exists in the staging table that doesn't exist in target-copy is inserted into the target table. (new data)
  • Data that exists in both staging table and target-copy is joined, columns values are taken from the staging table unless they are null. (old/updated data).
  • Both staging table and target-copy table are dropped.
  • In case of an error, the process tries to copy the target-copy back on to the original table.

Note: In merge, the incoming data must be unique according to the key fields you selected. You may use the aggregate component or the limit component (partition by the key field(s) and limit to 1 record per partition) to make sure that the key fields are indeed unique.

Advanced options

  • Maximum errors - If this number of errors occurs in BigQuery while loading data into the table, the job fails.

Schema Mapping

Map the dataflow fields to the target table's columns. Columns defined as key will be used as the sort key when Integrate.io ETL creates the table. If merge operation is used, you must select at least a field or multiple fields as keys, which will be used to uniquely identify rows in the table for the merge operation.

The data types in Integrate.io ETL are mapped as follows when the table is created automatically. Note that since Integrate.io ETL doesn't have a notion of maximum string length, the string columns are created with the maximum length allowed in BigQuery.

Integrate.io ETL BigQuery
String STRING
Integer INTEGER
Long INTEGER
Float FLOAT
Double FLOAT
DateTime TIMESTAMP, DATE, or DATETIME
Boolean

BOOLEAN

Map

JSON

Bag

STRING REPEATED

Note: Datetime fields can be mapped to TIMESTAMP, DATE, or DATETIME, depending on the System Variable _BQ_DATE_DATATYPE setting.  TIMESTAMP is the default value. Be aware that all datetime fields in the data set will be mapped to the same datatype.