Sources - S3

Description

The S3 Connector enables users to periodically sync files across multiple destinations. It supports a variety of file formats and allows for comprehensive configuration to meet users' data replication needs.

Supported Replication

  • Initial Sync
  • Continuous Sync

Authentication Type

IAM Role Authentication

Configuration

General Configurations

  • Start Date: The initial date from which files should be synced. Useful for historical data imports.
  • File Prefix: A string that files must start with to be considered for syncing. Helps in filtering relevant files.
  • File Regex: A regular expression to match file names. Offers precise control over which files are synced.

S3 Connection Configuration

  • Bucket Name: The name of the S3 bucket from which files will be synced. This is a key identifier in the AWS ecosystem.
  • Bucket Region: The AWS region the bucket resides in.
  • Role ARN and External ID: These are part of the IAM role setup that permits access from an external account to your S3 bucket. Customers are required to create a new policy and role within their IAM to facilitate this access. It is important to note that the external id is unique for every client, and this can’t be modified.

Table Name

  • Table Name: The table name is necessary and will be used to create a destination table for all synced data. Naming should adhere to the conventions and limitations of the destination, such as Redshift and Snowflake, to avoid errors.

Source Schema

The S3 Connector automates the schema fetching process from the source by sampling files within the S3 bucket. It attempts to infer data types but allows user modifications.

Supported Data Types

  • The connector supports various data types: string, boolean, number, integer, array, object, bigint, date, datetime, with string being the default.
  • Note: array and object data types will be inserted as blobs.

Schema Considerations

  • Users can adjust inferred data types but must be cautious of mismatches that could lead to pipeline failures.
  • Schema changes after setup are not recommended as they may require a full re-sync. For schema modifications, creating a new source and pipeline is advised.
  • The connector generates a custom primary key (rowNum_filename) for the destination table, which facilitates file syncing and versioning.

Data and Schema Consistency

  • Empty strings in CSV, TSV, & TXT files are treated as null unless specified as string.
  • The connector samples 50 rows from up to 5 files to infer the schema.
  • It is crucial that all files within a source maintain a consistent schema and that the data is valid and parsable. Failure to meet these requirements may result in pipeline failure.
  • Testing the source connection (Test connection) before schema fetching is essential to identify any authorization issues.

By adhering to these configurations and considerations, users can effectively set up and maintain the S3 Connector, ensuring smooth and accurate data replication processes.

Collections

  • Only one table per pipeline/source is supported on this connector.

Limitations

  • Only files with the below formats, and their gzip and bzip2 compressed versions are supported:
    • CSV
    • TSV
    • TXT
    • JSON
    • JSONL
  • Files not modified since the last run will not be synced again to prevent syncing the same file multiple times.
  • Ensure that the bucket-name, file prefix, and file regex are correctly configured to avoid missed synchronizations.