You’ve spent hours tinkering and preparing the perfect dataflow to batch process zillions of web logs. Feeling satisfied, you run the job on one of the clusters and leave your desk. The boss catches you on the way out - he wonders what’s going on with the clicks and impressions report. You promise it will be ready tomorrow and head for the exit.

The next morning, you comńe to work smiling, thinking that the work for the day is already done. You open the output for last night’s job. No results.

This horror story may sound familiar. It’s not easy to get the dataflow right, especially with batch processes for massive amounts of data. That’s why we would like to suggest eight best practices to help improve your data integration debugging and development.

Data Integration Development Best Practices

1. Start Small

Don’t try to tackle the entire peta, tera, or even giga of data right away. Start with a small sample of the dataset for development and debugging purposes. Take just one file, or use a WHERE clause if the data comes from a relational database. Using too much data at this point only lengthens development time. Process the entire dataset further down the line after you have confirmed that your dataflow works correctly.

2. Develop Gradually

Developing a long and complicated dataflow only to see it fail can waste plenty of time, not to mention that it is rather hard to debug. Instead, develop it gradually, part by part. Keep in mind we are working with a small sample of the data, so check the output after each intersection and make sure that the results are correct.

3. Filter Out Useless Data

Certain data processing functions, e.g. regular expressions and string manipulations, are heavy on resources. It’s best to keep it lean and only feed them with the data that you actually need. Select only relevant fields via projection and use filters to keep irrelevant data out of the flow.

4. Join Carefully

Many things can go askew with joins. Just recently we ran a join that went horrendously wrong. What we thought were nulls in the join field were actually empty strings. The join thus ended up as a full multiplication of both sides and ran to eternity and beyond.

Working with a small subset of data during the development process is already a good step forward. It is also highly recommended to take a look at data from the join sources and manually check whether they are joined correctly by checking row counts and value histograms after the join.

The next step you should take is familiarize yourself with the different types of joins and which one is right for your data:

  • Replicated Join - one of the datasets is small enough so that it fits in the memory

  • Skewed Join - one of the keys is much more common than the others, and the data for it is too large to fit in the memory

  • Merge Join - the datasets are both sorted in ascending order by the join key

  • Merge-Sparse Join - both tables are sorted by the join key in ascending order, and the right side table has less than 1% matching keys

  • Default Join - the data doesn’t fit into any of the above cases

Make sure, of course, that you put the relevant data source on the correct side depending on the join type. For example, a small dataset that fits in the memory should go on the left side of replicated join.

5. Store Results as Files

Using the database as immediate output during development is not such a good idea - you will find out about errors, like an invalid schema, only when inserting the crunched data into the DB. Better leave database trouble for later and in the meantime use output files on cloud storage such as Amazon S3. Once you’re done testing the flow, take the data from S3 and store it in the DB to fix any database schema and constraints issues - it’ll be a lot faster now that the bulk of the process is behind you. Finally, don’t forget to modify the original dataflow so that it also stores results in the DB from now on.

6. Split Parallel Dataflows

Dataflows where one data source is split into several parallel flows may work better when split into entirely separate dataflows. It could keep your processes better organized, and you will also be able to run them in parallel independently of each other. However, this isn’t always the case, so test if it works better as one dataflow or as separate dataflows.

7. Split Complex Dataflows

Dataflows that are too big and complex should also be split into several dataflows. This helps to debug each one more easily and make sure everything works correctly. The downside is that you’ll have to manually set output from the previous flow as input to the next one. This can be done automatically in via our API or with variables.

8. Use GZIP

Compressing input and output files saves plenty of time. Yes, it takes more CPU power to compress and decompress data, but that’s nothing compared the time saved transferring bytes over the network.


Data integration development is no easy feat, but certain practices can help to make it smoother: using a small dataset, gradual development, filtering out useless data, using the right joins, writing to files instead of the database, splitting parallel dataflows, splitting complex dataflows, and using GZIP compression. If you develop your dataflows correctly, your boss may be happy in the morning after all. At least about your reports.