ETL terminology

The main terms you will encounter in the ETL documentation include:


A connection defines the endpoint and credentials needed to connect to a data repository.


An ETL package is a dataflow or a workflow definition. Dataflows describe the data to process (location, schema, fields), data manipulation to perform, and the output destinations (location, schema).

Workflows define dependencies between tasks. For example: after dataflow A finishes successfully run dataflow B.

Once the package is defined, it is run as a job on a cluster.


An ETL cluster is a group of machines (nodes) that is allocated exclusively for your account's users. You can create one or more clusters, and you can run one or more jobs on each cluster. A cluster that you've created remains allocated to your account until you request to terminate the cluster.


An ETL job is a process that is responsible for running a specific package on a cluster. The job is a batch process that processes a finite amount of data and then terminates. Several jobs can run the same package simultaneously. When you run a new job, you select the name of the package to execute and the cluster on which to execute it.

Account and User

An ETL account represents a related group (usually a company) of ETL users connected to a specific provider/region. An account is created when a user signs up to the ETL service. An account is linked to a region, which is where clusters will be created and jobs will execute. The region should, therefore, be where your data is.