The right tool for ingesting client files from SFTP into your own database on a schedule is an ETL or data integration platform that supports SFTP as a source, offers configurable cron-based scheduling, and can write to your target database as a destination. Purpose-built integration tools handle file pattern matching, schema mapping, transformation, error alerting, and post-load archiving as part of a single repeatable pipeline.
Setting this up manually with shell scripts or cron jobs works at small scale, but breaks down when clients multiply, file formats drift, or load failures go unnoticed. A structured pipeline gives you visibility, reliability, and the ability to recover without manual intervention.
The Problem
Teams that receive files from external clients over SFTP often start with ad hoc scripts: a cron job that downloads whatever lands in a directory, a one-off script that parses the CSV and inserts rows, and a Slack message when something breaks. This approach accumulates debt fast.
Files arrive at unexpected times. Column names shift without notice. A client sends a malformed row and the entire load fails silently. New clients require new scripts. No one knows which files have already been processed. The result is duplicated data, missed loads, and hours spent debugging instead of analyzing.
A proper scheduled ingestion pipeline solves each of these problems with dedicated configuration rather than custom code.
What You'll Need
- SFTP server hostname, port, username, and either a password or private key
- Read access to the directories where client files are deposited
- A destination database with write credentials (PostgreSQL, MySQL, Snowflake, BigQuery, Redshift, or similar)
- A data integration platform that supports SFTP sources and scheduled pipeline execution
- A defined schema for the destination table or tables
- A directory or path on the SFTP server designated for archiving processed files
How to Ingest Client Files from SFTP into a Database on a Schedule: Step by Step
Step 1: Set Up SFTP Connectivity and Store Credentials Securely
This step establishes a verified, authenticated connection between your integration platform and the SFTP server. Before any file can be read, the platform needs to reach the server and confirm it has the right access level.
What to do:
- In your integration platform, create a new SFTP source connection using the server hostname and port (default is 22 for SFTP).
- Choose your authentication method: password-based authentication works but private key authentication is more secure for automated pipelines.
- If using a private key, paste or upload the private key contents into the credential store rather than embedding them in pipeline configuration files.
- Run a connection test from within the platform before moving on; confirm the platform can list files in the target directory.
- Store all credentials in a secrets manager or the platform's built-in credential vault, not in pipeline code or environment variables that get checked into version control.
Output of this step: A saved, tested SFTP connection that the pipeline can reuse across multiple scheduled runs without re-entering credentials.
Where Integrate.io helps: Integrate.io includes an SFTP connector with built-in credential storage, so private keys and passwords are encrypted at rest and referenced by name in pipeline configuration rather than stored inline.
Step 2: Define Your File Selection Pattern Using Wildcards and Directory Structure
This step tells the pipeline which files to pick up on each run. When multiple clients deposit files into subdirectories, or when files are named with timestamps, a well-constructed file pattern prevents the pipeline from loading the wrong files or re-loading files it already processed.
What to do:
- Map out the directory structure on the SFTP server. A common layout is /clients/{client_name}/incoming/ where each client has its own subdirectory.
- Decide whether the pipeline processes one client directory per pipeline or uses a wildcard pattern to sweep across all client directories in one pass. Separate pipelines per client are easier to debug; wildcard sweeps reduce the number of pipelines to maintain.
- Use glob-style wildcards to match files by name pattern. For example, orders_*.csv picks up any CSV file starting with "orders_", while _2026-.csv targets files with a year in the filename.
- If files arrive with timestamps in the name, use a date-aware pattern or configure the platform to filter by file modification time so only files deposited since the last run are picked up.
- Exclude any files that are already in the archive directory by pointing your file selection at the incoming/ path only.
Output of this step: A file selection pattern that reliably targets the correct files on each scheduled run without manual intervention or directory browsing.
Step 3: Map Source Fields to Your Destination Database Schema
This step connects the columns in the incoming files to the columns in your destination table. Getting this mapping right prevents type mismatches, missing fields, and failed inserts that are hard to trace back to their source.
What to do:
- Load a sample file from the SFTP server into the platform's schema inference tool to auto-detect column names and inferred data types.
- Compare the inferred schema against your destination table definition. Note any columns in the source that do not exist in the destination, and decide whether to include, exclude, or reroute them.
- Map each source column explicitly to its destination column rather than relying on positional or name-match auto-mapping. Clients frequently rename columns between file versions.
- Define the data type for each destination column clearly: dates as DATE or TIMESTAMP, monetary amounts as DECIMAL with appropriate precision, identifiers as VARCHAR or BIGINT depending on format.
- Add a source_filename or ingestion_timestamp column to the destination table so every loaded row carries metadata about where it came from and when it arrived.
Output of this step: A saved field mapping that links every relevant source column to a destination column with the correct data type.
Where Integrate.io helps: Integrate.io's visual field mapper lets you drag source columns to destination columns and set type overrides, which is faster than writing mapping configuration by hand and easier to audit when schemas change.
Step 4: Configure Transformation and Validation Rules
This step adds logic between the source read and the destination write. Raw client files almost always need cleaning: inconsistent date formats, extra whitespace, missing required fields, or values that fall outside acceptable ranges.
What to do:
- Add a date normalization transformation for any date columns where clients might send MM/DD/YYYY, YYYY-MM-DD, or DD-Mon-YYYY interchangeably. Standardize everything to ISO 8601 format before writing to the database.
- Strip leading and trailing whitespace from text columns. A client ID with a trailing space will not match the same client ID in your lookup table.
- Add a null check for columns that are required in the destination schema. If a required field is empty, route the row to a rejection log rather than letting the insert fail and halt the entire load.
- Apply any business-specific transformations: currency conversion, unit normalization, category code mapping, or splitting a full name column into first and last name fields.
- Use a row-level filter to exclude rows that match known junk patterns, such as header rows that appear mid-file or test records flagged with a specific value.
Output of this step: A transformation chain that converts raw client file data into clean, validated rows ready for insertion into the destination database.
Step 5: Set the Ingestion Schedule
This step defines when the pipeline runs automatically. The schedule should align with when clients typically deposit files and when the destination database needs to have fresh data available.
What to do:
- Decide on the appropriate run frequency: hourly polling, a fixed daily time, or multiple times per day. Match this to client SLAs and downstream reporting needs.
- Configure the schedule using cron syntax or your platform's visual scheduler. A daily run at 6:00 AM UTC, for example, would be expressed as: minute 0, hour 6, every day.
- If clients deposit files at unpredictable times, consider a short polling interval (every 15 or 30 minutes) combined with a skip-if-no-new-files setting to avoid unnecessary processing overhead.
- Set a pipeline timeout so a run that hangs on a network issue does not block the next scheduled execution.
- Enable concurrency controls so that if one run is still processing when the next is scheduled to start, the second run waits rather than creating duplicate loads.
- Document the schedule in your team's runbook alongside the expected file arrival window so on-call engineers know what is normal.
Output of this step: A pipeline that runs automatically on a defined cadence and skips or waits gracefully when files are absent or a previous run is still active.
Where Integrate.io helps: Integrate.io supports cron-based scheduling natively, with options to trigger runs on a fixed schedule or via an API call, making it straightforward to align ingestion timing with upstream client delivery windows.
Step 6: Configure Failure Alerts and Archive Processed Files
This step closes the loop on each successful (or failed) run. Without alerting, failed loads go unnoticed until a downstream report surfaces missing data. Without archiving, the pipeline has no way to distinguish files it has already processed from new arrivals.
What to do:
- Configure an alert that fires on pipeline failure and sends a notification to a shared team channel or on-call system. Include the pipeline name, the run timestamp, and the error message in the alert payload.
- Set up a separate alert for partial loads: runs that completed but rejected rows into the rejection log. A partial load is not a failure, but it needs human review.
- After a successful load, move or copy each processed file from the incoming/ directory to an archive/ directory on the SFTP server. Name the archive path clearly, for example /clients/{client_name}/archive/YYYY-MM-DD/, so files are organized by processing date.
- Set a retention policy for the archive directory. Keeping 90 days of raw files is a common standard; adjust based on your compliance requirements.
- If the SFTP server does not support file move operations natively, use a post-load step that writes a .processed marker file alongside each original file, and configure the file selection pattern in Step 2 to skip files that have a matching marker.
- Test the failure alert by intentionally misconfiguring the destination credentials, triggering a run, and confirming the alert fires within the expected window.
Output of this step: A pipeline that alerts on failure or partial success, archives every processed file for auditability, and leaves the incoming directory clean for the next run.
Common Mistakes to Avoid
-
Re-processing already-loaded files: Without an archive step or a processed-file marker, every scheduled run picks up the same files again and loads duplicate rows. Move or mark files immediately after a successful load.
-
Relying on auto-detected data types without review: Schema inference tools guess types based on the first N rows. A column that looks like an integer in the sample might contain text values further down. Always review inferred types against the full file spec.
-
Using a single pipeline for all clients with no per-client logging: When a multi-client wildcard pipeline fails, it is hard to tell which client's file caused the problem. Add a client identifier to every log line or use per-client pipelines for easier isolation.
-
Ignoring partial load alerts: A run that completes with rejected rows can look like a success in summary dashboards. Configure a separate alert for rejection log entries so partial loads get the same attention as full failures.
-
Hardcoding file path assumptions: Client file names and directory structures change. Build file paths from configurable parameters rather than hardcoded strings so updates do not require pipeline edits and redeployment.
-
Not testing the archive step before go-live: Teams often test the load path thoroughly but skip validating the archive behavior. Run end-to-end tests that confirm files land in the archive directory and do not reappear in the next scheduled run.
Conclusion
Ingesting client files from SFTP into a database on a schedule requires more than a cron job and a download script. A complete pipeline covers file selection, schema mapping, transformation and validation, scheduling, failure alerting, and post-load archiving. With all six components in place, the pipeline runs reliably without manual intervention, and failures surface quickly enough to fix before downstream processes are affected.
Readers who follow this guide now have a repeatable, auditable workflow that handles multiple clients, changing file formats, and partial load scenarios without custom code for each case. Integrate.io supports this end-to-end flow with native SFTP connectivity, visual field mapping, and cron-based scheduling, which reduces the time from initial setup to a working scheduled pipeline.
The next step is to extend this pattern to additional file formats or destinations as client needs evolve, since a well-structured pipeline configuration is easy to adapt once the foundation is in place.