>

Integrate.io Best Practices Guide: How to Optimize Your Account and Become a Power User

Unlock the full potential of Integrate.io with best practices for Speed & Performance, Security & Compliance, and Scheduling & Automation.

Our intuitive drag-and-drop interface streamlines pipeline management, reducing manual effort so you can focus on insights and action. Use component previews to test segments in isolation and optimize efficiency.

Need support? Our 24/7 chat team is here, and you can schedule a Success Engineer review for personalized recommendations.

Let’s get started—explore the table of contents to optimize your workflows and elevate your data operations.

Pillar I - Speed and Performance

  1. Start with a small data set.
  2. Don’t forget to save and validate your work.
  3. Only pull the data you need (incremental load).
  4. When and how to use a sandbox cluster.
  5. Offload work to the database or SFDC query (filter).
  6. Labels for easy understanding (naming conventions).
  7. Optimize for large data sets (the power of parallelism).
  8. Become aware of the tools available (setting variables).
  9. Smooth, simple, and systematic error handling (workflows).
  10. Over-communicate so everyone’s on the same page (notes).
  11. Create separate environments for staging/production (SDLC).
  12. One big package vs. several small packages (node load time).
  13. Organize data pipelines by team/medium/purpose (workspaces).
  14. Test in isolation: component preview/expression editor/X-Console.

Pillar II - Security and Compliance

  1. Establish a secure connection to Integrate.io.
  2. Require your team to become security aware (2FA).
  3. Decrease the odds of a compromised password (SSO).
  4. Hash or encrypt data your customers want kept private (PII).
  5. Know the privacy laws in your area or industry (HIPAA/GDPR).
  6. Be careful about granting admin privileges (user access control).

Pillar III - Scheduling and Automation

  1. Clone a similar data pipeline to save time.
  2. Check our templates before building from scratch.
  3. Find the right flow or frequency for you (schedules)
  4. Get notified ASAP if a job or package fails (service hooks).
  5. Plan ahead for API keys or values that change (user variables).

Given Integrate.io's flexibility and wide range of use cases, there’s plenty to explore! We’ll help you get up to speed quickly so you can make the most of Integrate.io. Feel free to share this guide with your team to ensure everyone is set up for success. Let’s get started!

Pillar I - Speed and Performance

A common client request we hear at Integrate.io: “We need to make the pipeline run faster.” Understandable - speed matters! 

But performance isn’t just about speed. Data quality is critical. In datasets with 100,000+ rows, a single error can have a huge impact. That’s why testing and debugging best practices are essential—to keep your pipelines running smoothly and error-free.

Start with a small data set.

First, let’s cover the language of Integrate.io. Connections are source and destination locations. This applies to databases, file storage, or Rest API’s. Packages are data pipelines. A dataflow is one pipeline. A workflow is a series of tasks – dataflows and SQL queries can be tasks in a workflow.

Integrate.io provides a Component Previewer, which is a fast and easy way to test your data pipeline. Click the yellow “Preview” button and you’ll see a small sample of your data. This is the simplest way to verify your data is structured according to any relevant rules or quality control standards. Note when using Component Previewer, it's best to have a Sandbox cluster running as this reduces the runtime of the Component Previewer.

We recommend starting with a small table that’s designed for the specific purpose of testing. It’s wise to confirm your transformations -- both at the field and table level -- are having the desired effect before applying them to your full data set. 

For a deeper dive on this topic, note the documentation linked below:

Don’t forget to save and validate your work.

Have you ever spent hours designing a perfect PowerPoint presentation, forgot to save it, and then had to start from scratch? It happened to me pre-autosave and I still have nightmares.

If you’re not careful, you can suffer the same fate with your data pipelines. It’s super important to save often or you’ll risk losing your hard work. This applies at both the micro and macro level.

Components are blocks of code that change or transform your data. Save each component when you’re done editing.  You can click “...” at the top right to save and validate your pipeline.

To learn more about the tests we perform to validate your work, read the doc below:

Only pull the data you need (incremental load).

Incremental load is a life-saver when it comes to optimizing your data pipelines for speed.

You only need to do one full sync to pull in all of your existing data. Beyond that point, it’s an expensive operation that provides no benefit.

What’s the fix? Adjust your package to only pull in new or updated data. This can be achieved with a timestamp that updates anytime a field value changes.

The syntax is simple, especially if you’re familiar with SQL and conditional logic. We will use a where clause like this on your database: updated_at > '$last_updated_at'::timestamp

To get more detailed instructions about how to set up an incremental load, read the docs below:

When and how to use a sandbox cluster.

Integrate.io provides two types of clusters. The production cluster is meant to be used with data pipelines in production on a schedule. The sandbox cluster should be used during the development and testing process. In addition, the sandbox cluster is a great place to run a historical load if you need to periodically.

A cluster is a group of nodes assigned to your account. One or more jobs can be run on a single cluster. The main difference is scalability. Production clusters can be scaled. The sandbox can’t.

If you’re testing, stick with the sandbox. After the tests pass, switch to production. Simple.

For more details about ETL testing and both cluster types, read the content linked below:

Offload work to the database or SFDC query (filter).

If you’re working with a database or data warehouse and looking for faster runtimes, here’s a tip. Many of Integrate.io’s transformations can be done via SQL commands in a database. So instead of using a component to join/filter/aggregate data, you can write a SQL query in the database source component that performs those same actions. Then the work of performing those transformations is done by the database instead of our infrastructure. The database is faster, because it has the data right there. No middleman required. If you have an abundance of joins and transformations, put those in a SQL query instead. This will make a dent in your runtime. Bear in mind this strategy requires SQL smarts and the data must live in the same database.

Labels for easy understanding (naming conventions).

If you don’t name a component, Integrate.io will automatically name it for you. That said, you’d be better off with a customized name that fits the component’s purpose. 

Imagine you have a huge data pipeline with a long list of transformations. One of them involves parsing nested data in JSON. Which name will be more understandable:

Select_3
Parse_JSON

We encourage you to meet with your team and agree to a labeling or naming convention system that is easy to understand for all parties involved.

You can also use our Notes functionality in the package designer to leave details of certain components for you and your team.

Optimize for large data sets (the power of parallelism).

Two of the most common situations where you might need to optimize for a large volume of data:

  1. Calling an API

  2. Database component 

Let’s say you want to get data from one endpoint and push that information to another endpoint. A common example would be grabbing an ID and then fetching information associated with it.

Without parallelism, Integrate.io will call that endpoint sequentially, passing each individual value in one at a time. To speed up this process you can induce parallelism that will allow Integrate.io to split that total number of API calls into five threads per node on the cluster.

A one-node cluster can support five threads. Two nodes takes the total to ten. In other words, parallelism will multiply the speed of your jobs by five. Nifty, right?

We can accomplish the same goal in the database source component by defining a key column and increasing the value of max parallel connections up to ten. 

You could also use the sort trick to force parallelism with your Rest APIs. Put another way, add a Sort component to your pipeline (in front of the CURL). Sort uses HADOOP to map, reduce, and parallelize your dataflow. Then you can increase default parallelism* by about 5-6 per node.

Always test with one node at first. If your data processing volume reaches a level where you need more nodes than allocated to your account, reach out to your AE or Success Engineer.

* For more info, consult this doc (under the header: “System Variables”).

To see more in-depth instructions on this topic, consult the following Knowledge Base articles:

Become aware of the tools available (setting variables).

Programmer types are familiar with variables. These are a very useful and efficient feature, especially when you’re working with a dynamic value. Integrate.io provides three types of variables:

User variables (set by you)

Pre-defined variables (set by Integrate.io)

System variables (set by you or Integrate.io)

Variables classify as an expression, which means you can use functions or methods to assign a dynamic value, such as a timestamp or ID that changes based on the context. 

Dive into the documentation below for a deeper understanding of variables in Integrate.io:



Smooth, simple, and systematic error handling (Workflows).

A workflow is a series of dataflows with conditional logic. Workflows are often used to log errors. 

For example: if you’re working with an API and the call is unsuccessful, you can push that info to an error logging table for future investigation. The same is true for failed jobs in general.

Workflows can also be used for complex processes. If you need to push data to one table before accessing that data in another package, you can connect them with a workflow. 

There are three options for your conditional logic here:

  1. On completion (if this job completes, then do this)

  2. On success (if this job succeeds, then do this)

  3. On failure (if this job fails, then do this)

Learn more about workflows and how they work at the links below:

Over-communicate so everyone’s on the same page (Notes).

If you’ve ever written code, you understand the importance of comments. No matter how perfect your code might be, that doesn’t mean it will make sense to the rest of your team.

The same is true for data pipelines. If there are essential details that will help your coworkers understand the package structure, click the yellow note button at the top of your dashboard.

Notes can also be used to plan ahead and troubleshoot issues. For example: “Hey team, I am running into an issue with parsing this nested data. Could you check my syntax for issues?”

Another one: “We need to connect the customer and orders table together here. Can we finalize the list of fields that should be pushed to our database? Thanks!”

For more tips to help you be more productive and efficient with Integrate.io, read the doc below:

Create separate environments for staging/production (SDLC).

The software development life cycle is designed to prevent errors and breaking changes.

Developers and engineers understand the importance of testing before deployment. If you push untested code to production, you could make a mistake that breaks the entire app or website.

The same practice should be applied to data pipelines, too. If you push an untested pipeline to production, you run the risk of corrupting your data destination with inaccurate information.

This is a recipe for disaster. It could take hours, days, or weeks to reverse these changes and get your data in good shape. To prevent this nightmare, you need a staging environment.

A staging environment is a great  safety net. You could make a hundred mistakes without doing any damage to your production database. This feature is available on our Enterprise plan or can be added as a line item to one of our other plans.

Alternatively some clients use Workspaces to separate their packages into separate stages – Development, Staging, and Production .

We also provide version control, so you can keep track of changes and revert to a previous version as needed. If you want to learn more about testing and version control, note the content below:

One big package vs. several small packages (node load time).

Should you sacrifice speed for simplicity? It depends. If you’re building and testing a package, we recommend focusing on one pipeline -- or transfer from source to destination -- at a time.

Otherwise, you’ll have a tough time troubleshooting any errors. That said, we encourage you to combine multiple pipelines in a single dataflow package after development to cut job run time.

Set up and tear down of a package is expensive from a usage perspective. After you confirm each pipeline is in good shape, copy/paste the related ones into a package for time savings.

You can also simplify packages and reuse code blocks with workflows as we discussed above.

Organize data pipelines by team/medium/purpose (Workspaces).

If you’re managing a ton of data pipelines with different purposes, your dashboard will get busy. Scrolling through hundreds of packages is a slow process. Let’s prevent the inconvenience.

Workspaces are a file and folder system. You can group packages together based on any criteria that matter to you. For example:

-Technology (Google, Facebook, Shopify)

-Team/department (support, sales, marketing)

-Client (ideal for agencies and similar business models)

No matter what your requirements are, you can use Workspaces to keep your team focused. Read more at the Knowledge Base article below:

Test in isolation: component preview/expression editor/X-Console.

Testing should be done early and often. Imagine building a data pipeline that includes 10-100 components and not testing until the end. Good luck finding the cause of your error message.

Integrate.io makes it easy to avoid this frustration. We’ve already covered the component previewer, which shows you a preview of the data. It’s an easy way to verify your data is in the right shape.

You can use our expression editor and X-Console for a similar purpose. If the expression editor doesn’t like your logic, it won’t let you save the result. Dig into our docs to diagnose what’s up.

X-Console is similar to a terminal or console.log. Copy/paste your expression, press “Enter,” and cross your fingers. If the output doesn’t look right, you can experiment until it does.

Click the brackets shaped like HTML code (<>) on the bottom of the page to open X-Console. This is also a great way to get familiar with the functions and expressions available.

For more information about testing and using expressions in Integrate.io, read the content below:

Pillar II - Security and Compliance

Being featured on the news is fun (unless it’s due to a huge security breach at your company). Who wants to be the star of that headline? Nobody I know. Talk about a hit to your reputation. 

All it takes is one compromised password to start a nightmare from which there’s no escape. This is why we provide several simple ways to safeguard your most confidential information. Implement the following steps to minimize your risk and become known as a security guru.

Establish a secure connection to Integrate.io.

We provide four ways to connect your sources and destinations to Integrate.io:

  1. Direct (Secure)

  2. SSH Tunnel (Super Secure)

  3. Reverse SSH Tunnel (Super Duper Secure)

  4. External SFTP (Send Your Data to a Cloud Server)

We never save or store your data. We’re only involved with the transit of your data.

All four methods include encryption to protect your precious and confidential information.

We’ll iron out this detail during onboarding. Ask your success/solutions engineer for more info. You’re also welcome to read the in-depth documentation below:

Helpful hint: type your source/destination into the Knowledge Base search box to access a doc that walks you through the process of establishing that connection: allowing Integrate.io access to

Require your team to become security aware (2FA).

First: strong passwords are a must. Shoot for twenty characters and a complex structure. Here’s a helpful hint -- full sentences are easy for you to remember and hard for hackers to guess.

Two factor authentication is one of the most effective ways to enhance your online security. Imagine one of your employees gets hacked, because they used duplicate passwords.

If you use 2FA, it’s a moot point (unless the hacker stole your employee’s device -- unlikely!). The log-in must be verified on a second device. Without that, the hack won’t be successful.

It only takes a second to set-up and the benefit lasts forever. For more details, read this content:

Enhance Security and Simplify Access with SSO.

Everyone knows reusing passwords isn’t ideal, but ensuring your team follows best practices can be a challenge.

While password managers help, Single Sign-On (SSO) offers an easier, more secure solution. Here are the top seven benefits of using SSO.

  1. Eliminates password fatigue.

  2. Enforces better password policies.

  3. Centralized control over who can access the system.

  4. Reduces the need for unsafe password management strategies.

  5. Boosts overall productivity due to faster log-ins and fewer lost passwords

  6. Lowers password-related calls to IT so they can work on more important tasks.

  7. Lowers the threat of data breaches by moving ID/authentication data off-premises.

Check out the content below to learn more about SSO and how it benefits overall security:

Hash or encrypt data your customers want kept private (PII).

Personal identifiable information is a big deal. If you work with phone numbers and addresses, be aware: people expect you to keep that information private. Fail and you’ll lose their trust fast.

We provide two ways to secure sensitive information. You could use a hash function to obscure that text and make it unreadable. And if it’s unreadable, no one will be able to use that info.

Alternatively, you could encrypt the data while it’s in transit and decrypt after it’s safe to do so. We partner with the Key Management Service at Amazon Web Services to make this possible.

This is a complex topic, so we recommend reading the content below for a deeper dive:

Know the privacy laws in your geography and/or industry (HIPAA/GDPR).

Those are scary looking acronyms, so let’s break them down to ensure understanding. 

The Healthcare Insurance Portability and Accountability Act (HIPAA) has been an important federal law in healthcare since 1996. The goal is to protect sensitive patient information.

Most patient health records are in digital form now. This convenience comes at a steep cost (potential for hacks). Good luck hacking into a filing cabinet. Fail to meet this law’s standards and you could be fined up to $50,000 per violation. 

The General Data Protection Regulation (GDPR) is designed to protect the personal or private data of citizens in the European Union (EU).  If you thought HIPAA was harsh, brace yourself. Fines for noncompliance run as high as €20 million or 4 percent of annual global revenue.

If either of these laws are relevant to your business model, do yourself a favor and read these:

FYI we can also run your data pipelines in Europe for GDPR using the AWS Ireland region. Notify your success and solutions engineer to take advantage of this customization. 

Be careful about granting admin privileges (User Access Control).

Let’s say you’ve hired a junior data engineer who is enthusiastic and ready-to-go. They have high confidence, but it’s somewhat unearned due to their lack of experience.

You ask them to update one record of your production database. Somehow a mistake is made and the update affects every record in the database. As Scooby Doo would say: “Ruh-roh!”

First, this is a prime example of why you need a testing and staging environment. Don’t force a junior employee to walk across a tightrope with no safety net. If you do, it’s begging for trouble.

Second, you could prevent this situation by only allowing your junior staff to update in staging. Any errors can be rolled back without affecting the production environment. Now it’s merely a learning opportunity (rather than a total disaster). You can always update their permissions later.

If we’ve convinced you of this idea’s merit, consult the content below for more in-depth info:

Pillar III - Scheduling and Automation

Your data pipelines are fast. Your work process is 100% optimized. Your security is tightened. Congrats! You’ve already made a ton of progress. But you’re not a true power user yet…

Don’t feel bad. There’s one step left. And it’s the easiest one. You already did the hardest part. All that’s left? Remove some manual work. Let’s simplify and automate as much as possible. Don’t skip this necessary step, because it could save you hundreds of hours in the long-run.

Clone a similar data pipeline to save time.

There’s no good reason to reinvent the wheel. If you’re using the same transformations in several packages, you might as well copy/paste those code blocks to save some time.

All you have to do is draw a selection marquee around the particular components you want to copy and use the keyboard shortcut to copy them (ctrl or cmd + c). And then use the keyboard shortcut to paste them into a similar package. Update the field names and ta-da, you’re done.

If there's only one component to copy and you want to use it in the same package: hover over the component you want to copy and then click on the overlapping squares icon that shows up.

If you run an agency style business and are building similar pipelines for several clients: click on the three dots next to the package name in your dashboard and select, “Duplicate package.” Want more information on the benefits of automation? Read the article linked below: 

Check our templates before building from scratch.

We’re proud to provide a wealth of data pipeline templates that make our clients’ lives easier. Click “New Package” in the package dashboard and then select a template for your dataflow.

Below is a list of the popular services (and their most common use cases) we have covered:

  • Adroll
  • Asana
  • Close.io
  • Freshdesk
  • Gitlab
  • Google Sheets
  • Hubspot
  • Intercom
  • Jira
  • LinkedIn Ads
  • Mailchimp
  • Marketo
  • Mixpanel
  • Outbrain
  • Pardot
  • Pipedrive
  • Recurly
  • Sendgrid
  • ShipStation
  • Shopify
  • Sparkpost
  • Stripe
  • Taboola
  • Trello
  • Twilio
  • Wrike
  • Xero
  • Yahoo
  • YouTube
  • Zendesk

If you know any other data types who work with these services and want to help them be a more productive and valued team member, feel free to send them our way. 

If you’d like to see us produce a template that isn’t listed here, contact your Success Engineer. They’d be happy to pass your feedback along to the dev team. 

We also provide account reviews as a part of your subscription and this is a great opportunity to optimize for speed and security.

Schedule your call here.

Find the right flow or frequency for you (Schedules)

Do you want to run your jobs manually forever? No way. You have other data pipelines to build. Let’s automate your jobs so you can focus on more important tasks.

Schedules may run at any frequency you desire. You can set them to run every x-amount of minutes, hours, days, weeks, or months. We also provide CRON expressions for specificity.

For example, you could use this expression to run a schedule every morning at 8 UTC: 0 8 * * *. 

A schedule may contain one or several packages. It’s totally up to you. You’re also welcome to customize the amount of nodes used. We advise starting with one and moving up as required.

Learn more about scheduling and automating your most important dataflows at the links below:

Get notified ASAP if a job or package fails (Service Hooks).

If you click the gear shaped icon in your dashboard, you’ll be taken to your account settings. This is where you can set up a service hook to notify you of job or cluster related events.

The most common use of a service hook is setting up a notification to alert you of failed jobs. This notification could be sent via Slack, email, Webhook, or PagerDuty.

With a service hook in place, you’ll be empowered to diagnose bugs or issues as they arise. 

For more details RE: how to monitor your work in Integrate.io, check out the documentation below:

Plan ahead for API keys or values that change (User Variables).

User variables are super helpful when you’re working with usernames, passwords, API keys, and dynamic values (such as an ID that updates on subsequent runs of a job or package).

We recommend testing with a hard-coded value to confirm your data pipeline is operating right. Afterwards, you can replace that value with a variable to simplify and streamline your process. For more information about how to use and set user variables in Integrate.io, consult this doc: