>

Getting Started

Integrate.io ETL’s platform allows organizations to integrate, process, and prepare data for analytics on the cloud. By providing a coding and jargon-free environment, Integrate.io ETL’s scalable platform ensures businesses can quickly and easily benefit from the opportunities offered by big data without having to invest in hardware, software, or related personnel. With Integrate.io ETL, every company can have immediate connectivity to a variety of data stores and a rich set of out-of-the-box data transformation components.

The main terms you will encounter in the Integrate.io ETL documentation include:

Connection

Connections define the data repositories or services your Integrate.io ETL account can read data from or write data to. The connections contain access information that is stored securely and can only be used by your account's members.

Package

An Integrate.io ETL package is a data flow definition. Each package can be either dataflow or workflow.

  • Dataflow- Integrate.io ETL Dataflow describes the data sources, transformations to perform, and the output destinations (location, schema).
  • Workflow- Workflows are packages that allow defining dependencies between tasks, such as executing a SQL query or running a dataflow package. You can define the dependencies and the conditions for executing a task, for example--task can be executed only when the previous task was completed successfully.

Once you define a package, you can verify it, and, as in any development lifecycle, fix any errors and re-verify until the package is ready to run as a job on a cluster.

You can create your own package from scratch or use one of our pre-defined templates. To create a package from a template, create a new package and select the desired template from the 'Templates' dropdown.

Cluster

An Integrate.io ETL cluster is a group of machines (nodes) that is allocated exclusively to your account's users. You can create one or more clusters, and you can run one or more jobs on each cluster. A cluster that you've created remains allocated to your account until you request to terminate the cluster.

Your account includes a free sandbox cluster for testing jobs on relatively small amounts data during the package development cycle.

Job

An Integrate.io ETL job is a process that is responsible for running a specific package on a cluster. The job is a batch process that processes a finite amount of data and then terminates. Several jobs can run the same package simultaneously. When you run a new job, you select the name of the package whose workflow the job should perform, and the cluster on which to run.

Getting Started

After signing in to your account, you will be taken to Integrate.io ETL’s dashboard, where you can access all components of Integrate.io ETL: connections, packages, jobs, clusters, schedules, and settings.

  1. The first step would be creating a package, by clicking on the New package button (2) under Packages (1) and selecting Dataflow (3).
    You can create your own package from scratch or use one of our pre-defined templates. Templates are packages that capture all of the important information from a particular source and push the data into the determined destination.

    To create package from a template, create a new package and select the desired template from Templates dropdown. Follow the instructions specified on the package notes for changing variable values or adjusting the template to your needs.



  2. In the dataflow UI, select +Add Component and choose which data source you would like to pull data from.


  3. After choosing your source component, click in the middle of the striped rectangle and configure this component (4).
  4. Click on the dropdown menu and select the connection type. If you haven't created the connection yet, click on +New to create it. 
  5. On the selected connection form, add the required details. Note that for part of the connection types, you should allow Integrate.io ETL access to the service or data repository before creating the connection. This may require setting up firewall rules, starting SSH tunnels or creating users with minimum required permissions. Read more about allowing Integrate.io ETL access to your data repositories here. NOTE: If your source isn’t listed in the connections list, you can use the REST API component to connect to most SaaS applications and other data stores that support REST API, and you will likely find a template to help you speed up the process.
  6. After selecting the desired connection, define your source properties and the schema. You will be able to view the detected schema and a preview of the data, which should assist in selecting the desired fields.

  7. Hover over the source component and click the + sign in the blue bar below (5). You can add transformation and destination components to the flow. Use transformations to manipulate, shape, standardize and enrich your data. Read more about using transformations here
    For now, let’s choose the destination of interest (6).


  8. Click the destination component and configure the endpoint. After selecting the target connection and defining the destination properties, you will be able to map the input schema to the target table columns. One important item to note is that clicking the Auto-fill button in the upper right corner on schema mapping step will auto-detect the schema of the data pipeline, saving you the time and effort to populate the schema manually. When completed, click Save.


  9. In the upper right corner of the package editor, click the checkmark button to Save and validate (8) the package. The validation checks the package for design-time errors, and saves the changes made in the package. Read more about validating a package here. After the package completes validation, you can click on Save & Run job to (9) save the package and run it as a job on a suitable cluster.


  10. In the Run job dialog, select the desired cluster. You can select a sandbox (which is free in all accounts and meant for development purposes) or a production cluster. If needed, you can create a new cluster. Read more about creating cluster here.

  11. Then, select the package that should be executed (by default, the current package will be selected), and click Run job.
    You can monitor the progress and details of each job, and once completed, a sample of the job's outputs. If the job failed, you will be able to see the error messages that caused it to fail.

Congratulations, you have completed your first job!