Glossary

What is Data Transfer?

Data transfer is the process of copying data from one location to another, involving at least an extraction step from the source and a loading step to the destination, with the data optionally transformed in transit.

Data transfer is the process of copying data from one location to another. The transferred data may be transformed in transit, or arrive at its destination as-is.

When the transfer process results in two copies of the data, this is known as data replication. If the original data source is to become obsolete, it’s called data migration.

How Is a Data Transfer Performed?

A data transfer involves at least two steps. First, data is obtained from the original source, which is called extraction. After that, the data is written to the target destination, a process known as loading. These steps can be performed manually or automatically.

Manual Data Transfer

For one-off jobs, data owners may choose to do a manual data transfer. The process for doing so depends on the nature of both the source and destination. Some options include:

API call: Many systems have a set of APIs that allow data retrieval. Data is usually exported as a file, such as a JSON, XML, or CSV file.
Manual export: Some legacy systems might only allow data export through a built-in export function. The output will typically be a semi-structured file, such as CSV.
Coding: In some instances, there might be a need to write a small application to pull data from a data source. This application will often be written in Python or R.

The output file is transferred to a location where it is accessible by the destination database. If the export file is to leave the organization’s security perimeter, this transfer must be done in a way that complies with security best practices.

Manual transfers can be automated to an extent, using batch files and Cron jobs. True automation generally requires an ETL (Extract, Transform, Load) platform.

Automated Data Transfer

A data pipeline is a software process that automatically transfers data from source to destination. ETL platforms are often used to implement data pipelines.

The data pipeline is integrated with the data sources, often using the ETL platform’s built-in library of integrations. Extracted data is passed through a transformation layer, ensuring that the transported data is compatible with the destination structure. Transformation can also remove invalid data from the transfer.

Finally, data is loaded to the destination. This can be done in two ways.

Asynchronous transfer: Data transfer happens on a regular schedule. Usually, the transfer job is set to run at night or whenever the network is at its least busy. This is the most resource-efficient approach, but it means that data is not always in sync between source and destination.
Synchronous transfer: Data is transferred whenever the source is updated. The two databases are synced in real-time, which means that the destination always holds timely data. This method can be more resource-intensive.

An ETL-driven data pipeline may have a mix of synchronous and asynchronous transfers, with different schedules for each source.

What Are the Most Important Considerations in Data Transfer?

Any data transfer comes with a certain degree of risk. There is the risk of data loss, the risk of data corruption, and potential exposure to third parties.

When planning a data transfer, the organization must consider the following:

Security

Data is at its most vulnerable when it is in transit between locations, especially if it is traveling outside of the organization’s security perimeter. The file could be intercepted by a third party, who could extract sensitive information from the export.

In a manual transfer, the export file should always be stored in a secure location, such as a cloud storage facility. Automated transfers, such as those done by ETL, do not expose data at any point during transit.

Availability

Data must be available to all users and processes when required. This means that the destination must be updated according to a schedule that suits business needs. The source must also remain available while in use.

When planning a data transfer, the data team must consider the user requirements at both source and destination. Asynchronous transfers generally have the least impact on the performance and availability of source data. However, if the users at the destination need real-time data, then synchronous transfer might be used instead.

Reliability

Any kind of regular data transfer must follow the schedule reliably. If the data is being used in production, a schedule disruption might cause a system failure. If the data is being archived, a disruption might result in data loss at the destination repository.

Automated data transfers are generally preferred for regular transactions, for this reason. A data pipeline powered by ETL will run in the background according to schedule and send a report if issues arrive. Manual transfers are more likely to go wrong and cause data loss.

Efficiency

Every data transfer incurs a cost in terms of resources.When using a cloud service such as AWS, there's also a financial cost for data transfers between services. Data transfer best practice is to reduce this cost as much as possible by transferring in the most efficient way possible.

Automation can help make data transfers more efficient, and the right mix of synchronous and asynchronous jobs can help maximize resources further. The challenge is to find the most efficient solution while also maintaining security, reliability, and availability.

Latency

Latency can be an unpredictable factor in database architecture, as data transfer speeds can vary according to factors such as network conditions. The impact of latency can be mitigated with careful design and attention to infrastructure issues, such as low bandwidth.

Latency can be an even bigger issue when working with Big Data. It's important to use a data structure that minimized the transfer distance and reduces the number of network hops required, so that data can move as quickly as possible.

Redundancy

Data transfer may create two persistent copies of data. In some cases, this might be a requirement – for example, when archiving production data, or when sharing data between systems that aren’t otherwise integrated. However, this can be inefficient if there is no requirement for a second copy of the data.

This is an issue of good data governance. The project stakeholders should have a clear understanding of the data requirements on both sides of the pipeline. If the destination doesn’t need an extant copy of the source data, then only a partial transfer is required. If one version of the data becomes obsolete, it should be immediately deleted.

Compliance Requirements

Transferring data can have compliance implications, especially when transferring personal information. This kind of data is covered by laws such as CCPA and GDPR, which govern how data can be processed and transferred. You may not be able to transfer personal data outside of your network or across international borders.

Transfers might sometimes involve an intermediate stage that can have compliance implications. For example, if you transfer EU data via an ETL platform based outside of Europe, you might be breaching GDPR. Make sure your provider is compliant with all relevant laws. Integrate.io operates in accordance with GDPR, CCPA, and most privacy laws that may impact U.S.-based businesses.

FAQ

Frequently asked questions

Clear answers to the questions teams ask when evaluating Integrate.io.

How is a data transfer performed?

A data transfer involves at least two steps: extraction, where data is obtained from the original source, and loading, where it is written to the target destination. These steps can be done manually, through API calls, exports, or coding, or automatically with a data pipeline powered by ETL.

What is the difference between synchronous and asynchronous data transfer?

Asynchronous transfer moves data on a regular schedule, often overnight, which is resource-efficient but leaves source and destination out of sync between runs. Synchronous transfer moves data whenever the source updates, keeping the destination timely in real time but using more resources.

What are the key considerations in data transfer?

Important considerations include security, since data is most vulnerable in transit; availability of both source and destination; reliability of scheduled transfers; efficiency and cost of moving data; latency, which can vary with network conditions; redundancy from extra copies; and compliance with laws such as GDPR and CCPA.

Back to glossary

Need help with your data integration?

Our team of experts is ready to help you build reliable data pipelines with Integrate.io.

Talk to an Expert