ETL is a foundational technology in enterprise database structures. It's used by millions of businesses every day. But that doesn't stop people from asking "Is ETL dead?"
The reason for the concern stems from the proliferation of more flexible Big Data tools. ETL is synonymous with data warehouse technology, but some industry players prefer more flexible models, such as the wild, unstructured data lake.
So, is ETL dead? Can businesses pull down the shutters on their data warehouses?
Not at all.
In this post, you'll find out why ETL is more relevant than ever, and what ETL can do for you.
- What is ETL?
- Why Do People Keep Asking "Is ETL Dead?"
- Is ETL Dead? No.
- ETL In Action
- The Future of ETL
- ETL with Integrate.io
What is ETL?
For those who are unfamiliar, ETL stands for Extract, Transform, Load. It is a software process used to fill data warehouses with information in three simple steps:
- Extract: The process pulls data from multiple sources.
- Transform: The incoming data passes through a transformation layer.
- Load: The ETL process will pushes data towards its destination
The basis of modern ETL comes from data warehousing methodologies that date back to the 1960s. ETL is the process of gathering raw data, like that from production systems. Once collected, the data transforms into a more readable, understandable format that conforms to business terminology and entities. The transformed and cleaned data is then loaded into a data repository of some sort, usually a relational database.
This is how the data architecture world worked for pretty much the past 40 years, basically since the advent of the relational database and data warehousing: raw data transformed by ETL processes and moved to the data warehouse for consumption.
Why Do People Keep Asking "Is ETL Dead?"
Technology has something of a "recency bias." It's like the opposite of the "if it ain't broke, don't fix it" philosophy. When something in tech is familiar and established, we automatically assume that it's next on the chopping block of disruption. As a 40-year-old technology, that is the most likely reason that some people talk as if ETL is obsolete.
The truth is, ETL is thriving. However, there are also some serious alternatives to the ETL approach that, while they aren't full substitutes, can accommodate specific needs.
Big data-friendly file systems like Hadoop allow us to store large amounts of data on fairly inexpensive storage platforms. There are definitely advantages to this approach. Hadoop, Spark, and similar systems are easy to scale up, making them ideal for massive data repositories. The ability to store structured and unstructured data side-by-side is also very convenient.
The drawback of systems like Hadoop is that they are inefficient. Not only that, but they are inefficient by design. Hadoop runs on very cheap hardware, which means it has a ton of redundancy in case any of its clusters fail. The thinking is that the inefficiency of a data lake offset the savings on data transformation.
Dynamic Business Requirements
Our organizational requirements around data are always evolving. We can't build static, inflexible pipelines that go straight from production databases to a data warehouse. Customers change every day. The world changes every day. That means that our data needs also change every day.
The data lake approach means that you don't have to think about this right now. You will have to think about it, eventually. But ETL forces you to consider your requirements before you start moving data.
Relational databases have a schema, which is the structure that holds all data. The magic of ETL is that it transforms data from one schema to another. This is useful when you have a bunch of source databases that you want to integrate into a single format. There's just one problem: you have to be able to model the data. The bigger and more complex your data, the harder it is to perform decent data modeling.
That's why it is sometimes easier to just dump everything in a data lake and sort it out later. This is often referred to as schema-on-read, as opposed to ETL's schema-on-write.
Is ETL Dead? No.
You might still ask, "is ETL dead?" especially if you've used older ETL tools.
Modern ETL platforms like Integrate.io are nothing like their ancestors. These are powerful engines of digital success, powering data pipelines with their extensive integration libraries and AI-powered interfaces. Let's look at some of the main features.
More Efficient Storage Usage
Data lakes are cheap, but they're not free. If you dump things in there without filtration, you will eventually end up with what's known as a data swamp.
Data swamps hold tons of stagnant data that offers no value, but for which you still have to pay hosting costs. Sorting the useful data from the garbage is almost impossible in a data swamp. In the long term, this may impact performance.
The ETL model offers a much more efficient long-term storage. Everything in your data warehouse is clean, reliable, and relevant to your business. More importantly, it's available when you need it.
Fast and Accurate Analytics
Data analytics is the killer app of Big Data. The insights you glean from analytics can help you win customers, identify inefficiency, and even identify potentially lucrative new markets.
But as in all things, there are some trade-offs. Data lakes allow you to perform analytics on vast quantities of data, but all of this has to pass through some kind of sorting process, such as MapReduce. At scale, this can have a significant resource cost.
Data warehouses guarantee that your BI and analytics are as fast and accurate as possible. Your analysts are working that's cleansed and sorted, thanks to ETL.
The concept of ETL might be 40 years old, but today's ETL tools are right on the cutting edge. We now live in the age of no-code/low-code ETL, where even the most complex transformations only need a few clicks.
This technology means businesses can adapt and evolve at the same speed as their data. More importantly, it means that ETL configuration isn't the sole preserve of coders and SQL geeks. Other business users can get involved and make sure that the data pipeline is fully optimized.
ETL tools also allow you to build complex and more dynamic pipelines. You can create rules and exceptions that cover every possible input, ensuring that everything keeps running smoothly without needing manual intervention.
Privacy, Security, and Compliance
Data isn't just a pile of ones and zeroes. Data describe real-world entities, many of which are very sensitive, such as your financial records and employee files. Most importantly, data describe your customers, who are trusting you to keep that data safe.
ETL gives you a secure point-to-point pipeline for important data. An automated ETL platform will move your data across networks without ever exposing it to prying eyes. Other processes can expose your data in several ways, including through the creation of unencrypted admin logs with metadata.
The transformation layer in ETL can also perform vital compliance tasks, such as data obfuscation. This essentially means hiding any sensitive information in your database, so that it's not exposed when it arrives at its destination. Obfuscation is useful when creating a test database or exporting data for analytics purposes.
ETL in Action
To show how much ETL matters today, let's take a look at some sample use cases for a modern ETL platform.
ETL Use Case #1: Integrating Customer Data
ABC Ltd. has a great relationship with their customers. The marketing team use marketing automation tools to acquire leads. Salespeople use their tools to convert those leads with targeted offers. The fulfillment team relies on ERP to get orders out quickly. If the customer calls with a query, the agent will look up their details on the CRM. The customer can log onto the website at any time and change their address or phone number.
But ABC Ltd. faces a challenge: Their processes have resulted in six different databases, each with a unique customer record. What happens if there is a discrepancy between these systems? How does someone obtain all of the company's data on one particular customer?
ETL is an elegant solution here. The ETL platform connects to all six sources and imports customer data. The transformation amends this data according to the master data, and it also resolves duplicates or inconsistencies according to business rules. Finally, the ETL pushes the definitive client record to an appropriate location, such as a data warehouse.
ETL Use Case #2: Disaster Recovery Backups
As part of their disaster recovery planning, LMNCorp has decided that they require regular backups of all mission-critical systems. Each system moves at a different pace: some only systems only update every other day; others update every minute.
LMNCorp wants robust, intelligent automation their backup system, rather than rely on batch files, Cron jobs, or manual backups. They also want to store their backup data on a cloud server.
ETL can connect to all of the most crucial production databases and perform backups according to a dynamic schedule. LMNCorp might choose to perform a minimal transformation on backup data so that they have a snapshot if they need it. Alternatively, they might choose to remove unnecessary data in order to speed up the recovery process.
ETL Use Case #3: Business Intelligence Dashboards
XYZ Inc. is launching a new product. They have a number of data-rich systems that can help them measure the product's success. These systems include sales systems, a CRM that logs customer interactions, website analytics, online survey tools, and social media sentiment analysis tools.
Each of these tools produces an insightful report, but each report runs separately. What's more, these reports are not in a standardized format. Analysts often find themselves making apples-to-oranges comparisons when looking at different reports.
ETL solves this problem by integrating data into a single source. This source is available to BI tools, which can perform complex, multi-dimensional queries. The results are available in a single dashboard that show's everything that XYZ Inc. needs to know about their business.
The Future of ETL
ETL has been evolving for 40 years and it shows no signs of stopping. Today's ETL is cloud-based, automated, and attuned to the needs of your business. Future trends will help to keep ETL alive in a new era of big data. These trends include:
- Real-time streaming ETL: ETL often relies on batch processes that run at certain intervals, such as hourly or daily. Real-time streaming ETL is a responsive process that transmits data live. So, when the target system updates, the ETL process streams it to the destination server. This allows for more accurate analytics and tighter system integration.
- ETLT: ETL or ELT? Why not both? ETLT combines both strategies into four steps: extract, basic transformation, load, and then a more detailed transformation. The ETLT process is all about prioritizing what's important, so it takes care of the most essential transformations first and then makes the data available for use. Afterward, an additional transformation process can do the final tidy-up required for long-term storage.
- Data lakehouses: We've talked a lot about the difference between data lakes and data warehouses. A data lakehouse is a best-of-both-worlds approach. The repository is a data lake, but it supports virtual warehouse structures that hold integrated data. All of this relies on AI and multiple concurrent ETL processes.
ETL with Integrate.io
"Is ETL dead?" is entirely the wrong question. What you should really be asking is: "What can ETL do for me?"
Or, more accurately, "what can Integrate.io do for me?"
ETL is only a concept; Integrate.io is the most sophisticated implementation on the market today. With a massive integration library, no-code and low-code configurations, and support for the best BI tools, Integrate.io is a whole new life for ETL. To see it for yourself, get in touch with our support team today to discuss your needs and schedule a demo.