The Workflow feature is an advanced offering in the Integrate.io platform. For uses who want to combine various dataflow packages via a user-friendly interface, Workflows allow you to arrange your inter-dependent ETL tasks with some conditional logic.
In this post, we will cover what Workflows are and how they can help you manage inter-dependent ETL tasks.
Table of Contents
What are Integrate.io Workflows?
Conceptually, a "Workflow" is a special type of package that lets you arrange tasks in the desired order of execution. A Workflow "task" can be either a pre-existing dataflow package or a SQL query operation. The drag-and-drop interface lets you arrange your tasks in the desired order of execution, making Integrate.io Workflows simple to configure while having a powerful impact on your data.
Workflow Main Features
1) Trigger Different Execution Paths Based on Execution Status
In a Workflow, you can add and connect two tasks based on the execution relationship between them. In other words, you can execute different paths based on the execution status of the preceding tasks. Here's a simple example:
The execution statuses supported by Integrate.io are "on success," "on failure," or "on completion." When you run the above Workflow,
post_completion_query would be executed once the task completes, and either
post_success_query will be executed depending upon the success or failure of the task.
2) Add Execution Constraints Based on the Execution Status of Preceding Tasks
In Workflows, you can add pre-conditional logic to any task based on the preceding tasks' execution status.
You can add any/all execution constraints (logical ORs and ANDs) to the tasks. In the Workflow defined above,
dataflow_task_2 will only execute if both the
sql_query_1 run successfully. If we had chosen the OR constraint for
dataflow_task_2, it would have run if any of
sql_query_1 was executed successfully.
Features 1 and 2 combined can model complex execution profiles where we want our packages to be executed in a specific order (A before B, B, and C before D, etc.), and their execution is dependent on the success or completion of each other (run B if A fails, run D if both B and C succeed, etc.).
3) Pass Dynamic Variables to be Used in Tasks
Just like dataflow packages, the workflow containing them has their own variable scope as well. If the variable is defined at the dataflow package level, it's value gets used; otherwise, Integrate.io will lookup for the variable at the workflow level. These workflow variables can be provided during Workflow creation and can be overridden when executing them through a job or a schedule.
SQL query tasks in a Workflow can return values (query results), which can be saved to Workflow variables containing a scalar value. This value can then be passed to successor query tasks or dataflow package tasks.
In the Workflow above, the first SQL query task checks for any active promotions that are not yet published. If there's any such promotion, its identifier is stored in the variable
active_promotion_id (by default, the first column value in the first row of returned results is stored). This identifier is then passed to the dataflow package variable to the
publish_promo package. Finally, if the package execution is successful, there's a query task to mark the promotion as published in the database. This query uses the same identifier to locate the promotion in the database.
Example Workflow Use Case
An e-commerce company that uses Integrate.io runs periodic marketing campaigns and promotions to improve their business. To kickstart a campaign, they must run a few preparation tasks, and they may want to run some reconciliation tasks after it is over.
Here's one possible way they could define their Workflow:
- Runs an SQL query in their database to check if there's a campaign scheduled to launch on the current day. Save the campaign identifier to a Workflow variable.
- If there's an active campaign, trigger some ETL tasks like creating tables to collect engagement metrics, updating the banner content in landing page CRM, updating the databases of other dependent services (like PDP, Customers, etc.), preparing and sending bulk notifications, performing cache warming activities for targeted customers, and so on. The campaign identifier variable saved in the last step is passed in these successor tasks.
- These interdependent tasks are arranged to be executed in the right order (Bootstrap data tables -> Update the landing page banners → warm the cache → generate bulk notifications).
- If any of these tasks fail, there might be a recovery task that triggers actions like marking a failure status in the database (an SQL query), re-scheduling the task, or calling APIs that perform diagnostics alerting the concerned team about the situation. A similar set of actions could be defined for the case when all the tasks run successfully.
- There can be another SQL query task in the Workflow to detect if there's a campaign that has recently expired and then schedule tasks that aggregate metrics for later analysis or perform clean-ups.
- Once constructed, the entire Workflow can be scheduled to run periodically (let's say 30 minutes before midnight every day), where different inter-dependent tasks run in coordination without manual intervention.
Apart from the useful Workflow features discussed previously, here are some of the other ways in which Integrate.io helped the e-commerce company:
- The tasks mentioned above can be self-contained dataflow packages in Integrate.io. This leads to a modular and maintainable set up where you can distribute responsibilities efficiently to different teams.
- The Workflow arranges these tasks to be run in the correct order while abstracting the implementation complexities. It provides a bird's eye view of how different systems will interact.
- They are versioned so you can switch to previous versions in case of modeling issues.
- You can monitor the execution status of workflows in the Integrate.io dashboard, get notifications on failures, and change their schedule.
- Maintaining such set up on your own can be very time/resource consuming. It involves managing Database connections, maintaining cron expressions and multiple processes, modeling the dependencies, adding an alert system, etc. Integrate.io offers all this already through an easy-to-use low-code interface that can blend into your existing workflows.
Here's a short video that implements a simplified version of the above use-case:
A simple but powerful feature, workflows allow you to combine various dataflows from within the user-friendly Integrate.io interface. For more information on how to build workflows, refer to our Creating a Workflow document or contact our support team for assistance.