ETL (extract, transform, load) is the backbone of modern data integration pipelines and has been around in some form since the 1970s. Organizations from tiny startups to massive multinationals depend on ETL pipelines to enjoy fresh, accurate insights that help them beat their competitors and better serve their customers.
Traditionally, companies have relied on batch processing to run their ETL workloads. In the recent past, however, we’ve seen more and more businesses switch to streaming or real-time ETL. So what’s behind this evolution from batch ETL to streaming, and how can you build your own streaming ETL pipeline?
Table of Contents
- What is Batch ETL?
- What is Streaming ETL?
- Real-Time ETL: Evolving from Batch ETL to Streaming Pipelines
- Do You Really Need a Streaming ETL Pipeline?
What is Batch ETL?
Batch ETL is the traditional way of performing ETL: in batches. Here, the term “batch” refers to any collection of data sourced within a given period of time.
First, batches of data are extracted from sources (e.g. files, databases, websites, or SaaS applications) at regular intervals. Second, the data in each batch is transformed according to various business rules, preparing it for storage in the target location. Finally, the data is loaded into the target data warehouse or data lake.
A chain of fast-food restaurants, for example, might perform batch ETL once each restaurant has closed for the night in order to calculate the daily revenue at each location. Streaming ETL would be less useful in this use case, since you likely don’t need to constantly monitor a restaurant’s revenue at every point throughout the day.
The benefits of batch ETL include:
- Simplicity: Batch ETL is typically simpler to implement and execute, which makes it preferable when you don’t require up-to-the-minute insights.
- Legacy support: Many legacy systems are not compatible with streaming ETL and are only able to work with batch processing. If you have legacy technology in your IT environment that you need to support, then batch processing will need to be at least part of your ETL pipeline.
What is Streaming ETL?
Streaming ETL, sometimes called real-time ETL or stream processing, is an ETL alternative in which information is ingested as soon as it’s made available by a data source.
The architecture of a streaming ETL process is not quite the same as batch ETL, in which information is extracted from some source or sources and then transformed before being loaded into the destination (sometimes called a sink). Rather, streaming ETL makes use of a stream processing platform that acts as an intermediary between the data sources and their final destination in a database.
A growing number of companies consider streaming ETL to be a better fit for the modern business landscape. If you rely heavily on sources such as social media and Internet of Things (IoT) sensors, you truly are working with never-ending streams of information—users are constantly making new social media posts, and IoT devices are constantly transmitting new data. In this case, batch ETL might be too slow and unwieldy to deal with the massive onslaught of information.
Fraud detection is one example where streaming ETL is a much smarter choice than batch processing. Whenever a credit card owner makes a transaction, the credit card company needs to analyze it for suspected fraud and alert the owner within minutes if the transaction is suspicious: for instance, if a purchase was made at an unusual store, in a different state or foreign country, or at an unusual time of day. Since batch ETL might be conducted on a timescale of hours or even days, it would be far too slow to catch fraudsters in the act and prevent further abuse.
The benefits of streaming ETL include:
- Speed: With streaming ETL, you can enjoy access to fresh new data as soon as it arrives. Real-time ETL is usually the better choice when being able to react to new information is more important than ensuring that this information is 100 percent accurate.
- Cloud-based: Many streaming ETL tools leverage cloud technologies, which have their own advantages over traditional on-premises ETL: cloud-based ETL is easier to scale, often more secure, and often more cost-effective.
On a final note, the term “event-driven ETL” is essentially synonymous with streaming ETL. In event-driven ETL, the ETL process is initiated in response to some event (e.g. the arrival of a new server log or sensor reading), and the new data is ingested as soon as possible.
Real-Time ETL: Evolving from Batch ETL to Streaming Pipelines
As mentioned above, ETL has its origins in the 1970s, when businesses first began to store information in multiple databases. In order to view all of this data within a single pane of glass, these organizations needed a way to perform data integration efficiently and repeatedly—and thus ETL was born.
For most of the history of ETL, technical limitations have prevented businesses from performing real-time analytics. Now, however, the tide seems to be turning, with technical limitations stopping businesses from getting the most from their batch ETL workflows. Some organizations, for example, have so much data on hand that they simply aren’t able to process it fast enough with batch ETL. Observing the continued upward trajectory of big data, some analysts have even gone so far as to announce the “death” of batch ETL and hailed the arrival of their new stream processing overlords.
The trends that have enabled the evolution from batch to streaming ETL are as follows:
- Single-server databases are on their way out; in their place, we see the proliferation of many different types of data platforms, scattered across the enterprise. This multi-pronged architecture complicates the relative simplicity of batch ETL.
- In the same vein, we see a multiplicity of data sources, from the web, social media, and mobile to logs and IoT sensors. Batch processing is often incapable of keeping up with the heightened volume and variety of big data that modern enterprises handle.
- Businesses are increasingly seeing the appeal of having real-time analytics at their fingertips. A Harvard Business Review survey, for example, found that 60 percent of businesses agree that it is “extremely important” to deliver real-time customer interactions across different touchpoints and devices—something only possible with streaming ETL.
Rumors of the death of batch ETL have been greatly exaggerated, but it’s true that real-time stream processing is quickly gaining in adoption and popularity. Still, there’s no reason that you have to choose one or the other exclusively—you can employ a combination of both batch and streaming ETL for different use cases within your IT environment, as long as it best fits your business requirements. In the next section, we’ll look whether real-time ETL is really necessary for your organization.
Do You Really Need a Streaming ETL Pipeline?
Given the various advantages of streaming ETL, it’s no surprise that you might be considering a move to real-time ETL yourself. But is streaming ETL right for you, or will you just be adding more complexity to your IT environment for relatively little benefit?
First, if you're looking to build your own streaming ETL pipeline, what are the options available to you? There’s no shortage of real-time streaming platforms that you can use as the backbone for your real-time ETL pipeline. Open-source tools such as Apache Kafka, which includes the Kafka Streams library for building streaming ETL pipelines, are excellent (and cost-effective) options. However, they come with their own drawbacks:
- Open-source tools aren’t always the most intuitive or user-friendly of offerings, which means that you’ll need to hire in-house developers with experience using them. This represents a large cost that you might not immediately think of when you hear the words “free and open-source.”
- Finding help with support and maintenance for open-source tools, despite the healthy user community, will be more difficult than working with a third-party ETL provider.
There are also a few more issues to consider before you decide that a move to real-time ETL is right for you. For example, while speed is often cited as a benefit of streaming ETL, it can also be a drawback. According to a survey by Kaggle, the top 5 most common data science challenges include "dirty data" (data that is incomplete, duplicate, or inaccurate) as well as challenges with the availability of or access to the data you need. Any enterprise-grade streaming ETL solution must be able to deal with this dirty data and deliver insights to the people who need them quickly and efficiently—all while handling massive quantities of information in real-time.
In her talk "Personalizing Netflix with Streaming Datasets," Netflix senior data engineer Shriya Arora cautioned would-be users of real-time ETL against the impulse to "stream all the things." The potential downsides of using streaming ETL include:
- Uncharted ground: Batch ETL has been around for decades, while real-time ETL is relatively uncharted territory. This means that using real-time ETL for certain use cases and applications (e.g. machine learning) will require a lot more effort and experimentation than batch ETL.
- High fault-tolerance: Real-time ETL systems are collecting data 24/7, so they have to be constantly available in order to avoid irretrievable data loss. This requires a highly fault-tolerant infrastructure with constant monitoring and alerts. This also makes any outages or performance issues with streaming ETL much more urgent than with batch ETL.
- Accuracy and quality: Using batch ETL gives you more time to ensure that you're working with high-quality, highly accurate data that has already been reconciled among your various sources. Streaming data, on the other hand, must be processed almost instantaneously, which means that you need a way to guarantee its quality and accuracy in spite of this speed.
- Recovery and repair: Recovering from crashes with batch ETL is relatively easy: you simply need to rerun the job, as the data can still be pulled from the original sources. With streaming ETL, however, the data may no longer be available unless you've made a backup somewhere.
Despite the disadvantages of streaming ETL, there are still compelling reasons to use it—especially when the data you're working with really is arriving in real-time. You might therefore look for a compromise between the two options of batch and streaming ETL. For one, there's no reason why you can't use both batch and streaming ETL at the same time, complementing each other's strengths and compensating for the other's weaknesses.
Another solution is to use "micro-batch" ETL in an attempt to strike a happy medium between streaming and batch ETL. In micro-batch processing, data is collected at intervals more frequently than traditional batch processing (e.g. from every few minutes to every few hours) in smaller quantities known as micro-batches.
Using micro-batching can be an effective solution for when you want results sooner than you're currently getting them, but when the use case doesn't necessarily support a move to streaming ETL. One good use case for micro-batch ETL might be an e-commerce website that wants to test the effects of a major change to its website (e.g. a user interface upgrade). Although fetching data in real-time isn't strictly necessary, getting fresh insights is still important—if something goes wrong, it will have an impact on user behavior very quickly, and developers will need to know about it so that the changes can be rolled back and you can avoid a significant loss of revenue.
No matter which solution you choose to go with, it's clear that batch ETL has its uses and is here to stay. That's why Integrate.io has built a cloud-based data integration platform for batch ETL. The Integrate.io platform comes with a simple drag-and-drop interface and more than 100 pre-built connectors, making it easy for anyone to build robust ETL pipelines to get the insights they need.
When it comes to the evolution from batch ETL to streaming, here’s what you need to know:
- Batch ETL refers to the processing of data in batches at scheduled intervals while streaming ETL handles data in real-time (or nearly so).
- Although batch ETL is more straightforward and compatible with legacy systems, streaming ETL gives you up-to-the-minute information and is necessary for use cases such as fraud detection and payment processing.
- Various technological developments have increased the popularity of streaming ETL, although batch ETL is still the right choice in many circumstances.
- In most cases, a move exclusively to real-time ETL isn't necessary: you can keep getting the insights you need with batch ETL, or with a combination of batch and streaming.
Looking for a mature, feature-rich, intuitive batch ETL solution? Give Integrate.io a try. Our powerful, user-friendly data integration platform makes it easy to unite your data sources and build robust pipelines to your choice of a data warehouse or data lake, including Amazon Redshift, Google BigQuery, Microsoft Azure, Snowflake, and more. Schedule a call with our team of ETL experts today for a chat about your business needs and objectives, or to sign up for your free trial of the Integrate.io platform.