This article will show you how to design a data flow that reads data incrementally from a file storage, so that only new (or changed) files are read every time the process is executing.
- File storage bucket with a directory that contains files to read incrementally, with read-only permissions for Integrate.io ETL at the minimum.
- File storage bucket with read-write permissions for Integrate.io ETL to write its meta data.
Start with adding a File Storage source component to your package and click the new component to edit its properties.
Fill in the bucket and path for the source data (in this example, the bucket is integrate.io ETL.public and the path is /twitter)
Change the source action to “Process only new files (Incremental load)” , select the manifest connection and add a files manifest path. In the path, use a bucket with read-write permission for Integrate.io ETL. In our example it’s integrate.io ETL.dumpster/manifests/twitter_reader.gz.
That’s essentially it! This tells Integrate.io ETL to list the files in the source path, compare the list to the manifest file and read only the new or changed files. If the manifest file doesn’t exist, Integrate.io ETL will read all files in the source path - that’s what happens when you execute the package for the first time, or if you delete the manifest file. Once the package executes successfully, the files read by the package are added to the manifest file.
Note that your path can contain a pattern or a variable and incremental reading would still work. However, files that are not found in the source path are removed from the manifest. This can be a good thing if it allows you to maintain a smaller manifest file, but if you intend to add paths you previously read from to the source component, these files will be read again.
Finally, complete your data flow and execute it.