Use the file storage source component to read data stored in a file or multiple files in object stores such as Amazon S3, Google Cloud Storage or Azure Blob Storage or in file servers such as SFTP.
Connection
Select an existing file storage connection or create a new one. For more information, see How to connect to your data sources.
Source Properties
Source location
-
Source bucket - The name of your cloud storage source bucket that contains the folders and objects defined in the path. Only relevant in the case of object stores such as Amazon S3 and Google Cloud Storage object stores.
-
Source path - The path to your input folder or file. Examples:
- Folder: sales/2015/01/ (Note: This is NOT supported for GCS. Use pattern instead to specify multiple files.)
- File or object: sales/2015/01/log.csv
- Pattern: sales/2015/{01,02}/
You can use wild card characters for pattern globbing.
Note - file and directory names that begin with an underscore (_) or a dot (.) are ignored and data contained in them will not be read.
Source file format
-
record delimiter - defines what breaks the data into records.
- New line (\n,\n\r,\r) - each line in your files is a record.
- End of file - each file is treated as a record.
-
record type - defines the format of the record.
- Delimited values - fields are delimited by a delimiter you define such as tab, comma or otherwise (see delimited values parameters below).
- Excel - a MS Excel file (.xlsx)
- Json object - each record is a Json object (enclosed in curly brackets).
- Raw - the record is read in its entirety into a single string/binary field.
- Parquet - process .parquet file(s). Note: we are unable to ingest column with DATETIME/INT96 datatype.
Read more about selecting the right record delimiter and format.
Note: The source data can be compressed (zip, gzip or bzip2) or uncompressed. If the zip file is password protected, enter the password in the zip password field; otherwise, leave it blank.The source data must be utf-8 encoded. If the data is encoded in another format, see this doc for instructions on parsing.
Delimited values parameters
If you selected new line record delimiter and delimited values record type, define the delimiter character used to separate fields in your objects and whether the data is enclosed in quotes.
- In the field delimiter drop-down list, select one of the characters (, tab). You can also type a single character or one of the following escape sequences:
- \b (backspace)
- \f (formfeed)
- \n (newline)
- \r (carriage return)
- \t (tab)
- \' (single quote)
- \" (double quote)
- \\ (backslash)
If some or all of the fields are enclosed in single or double quotes, select ' or " in the string qualifier drop-down list. If the fields may also contain line breaks, select " (newline inside) or ' (newline inside) according to the string qualifier used in the files. Use the "newline inside" option with caution as unbalanced double quotes may have undesired effects on job performance.
You can select an escape character based on your input files if a string qualifier is chosen on the previous step. This is so that the escape characters are omitted and records are ingested accordingly.
Check First row contains column names if there is a header row in each source file and you wish to skip it.
Excel parameters
-
Sheet Name - You need to specify one sheet name from your Excel file.
- Check First row contains column names if there is a header row in the sheet and you wish to skip it.
Json parameters
-
Base record JSONPath Expression - You can use a custom JSONPath expressions to define the base record and extract nested objects and arrays. Select the object preset, to use the keys of the JSON object as the input fields. Select the array preset, to use the keys of the JSON objects within the array as the input fields. Type in a custom JSONPath expression to extract fields of nested objects/arrays (e.g.
$.data[*]/
would use the keys of the JSON objects with the array named "data" in the input JSON).
Source action
-
Process all files directly from source - by default, files are read and data is manipulated within the same process.
-
Process only new files (Incremental load) - Define connection and path to store file manifest (see below) to let Integrate.io ETL read only new files in your source bucket/path. This is not supported for Excel files.
-
Manifest connection - By default set to the source's connection. You may use another connection that has read/write access to store the manifest file.
-
Manifest path - Fill in
bucket/directory/filename.gz
. This is where the manifest file will be stored when the job is executed. The manifest lists all files that have been processed. Each job execution compares the current file listing in the input path to the manifest and only processes new files. Backups of the previous manifests will be stored with each job execution for debugging/rollback purposes.
-
Copy, merge and process all files - Use when the source path contains many small files and you'd like all of them. Integrate.io ETL will first read all files, merge them into larger files and then process them as large files which are faster to read and process.
Note: The process will fail if the input path is for a single file. You may get strange results if your delimited data doesn't end in a line break or if your files contain header lines.
Character Encoding
- You can specify the character encoding of the input file. Only available when source action is set to copy, merge and process all files.
Source Schema
After defining the source location and format, you select the fields to use in the source.
- With delimited values, fields are read by order so make sure you define all fields that exist in the source files by order.
- With JSON input, you may define only the fields that you wish to use in your package.
- With raw input, there is only a single field that contains the entire record data.
- Define how you will refer to these fields (alias) in the other components of your package. If you use illegal characters, we'll let you know before you close the dialog.
- You can also add
file_path
field (Under Meta Data) to get the path to the file as a field in your data.
For each field, define the alias to use for the field in following components and data type. For JSON input, define the key in the source JSON file for each field. Read more about processing JSON data here.