As a business analyst or DBA, you know your organization's data front to back. You've got customer orders stored, issues tracked, and the browsing habits of website visitors logged. But you've got to scale up, and the constraints of a warehouse are sometimes too tight. The schema restrictions and the tight coupling of compute and storage are just a couple of potential sources of frustration. But a data lake might be too far in the other direction. Yes, it gives data scientists and their predictive models a place to run wild, but at the cost of meaningful data for decision-makers. Is there a middle ground without the full features (and inflated capital expense) of two or more solutions?
Table of contents:
The Warehouse vs The Lake
If your data is in a warehouse, it's structured, and rigidly so. Your data is normalized and no field contains anything too unwieldy. Because of the structure, data is immediately readable to those who know the business and usable to applications that know the connection info.
In a data lake, data is as raw and loosely bound as the water in an actual lake. Although it can contain structured data, it is more likely to be unstructured or semi-structured. Along with losing structural rigidity, moving to a lake means you also lose the ACID compliance of warehouses. For those that may not know:
- A - Atomicity. A transaction either completely succeeds or completely fails (no partial successes).
- C - Consistency. The constraints of the database system can expect to be followed.
- I - Isolation. Transactions act independently of one another and do not affect each other.
- D - Durability. When a transaction is committed, it remains so, even through any subsequent system crashes.
That's just one of the reasons why, according to Databricks.com, "many of the promises of the data lakes have not materialized, and in many cases leading to a loss of many of the benefits of data warehouses."
If you're unfamiliar with one or both terms, you can get up to speed quickly by reading seven key differences between the two.
Enter the Data Lakehouse
A data lakehouse is a new (or at least newly popular) trend that, as the name implies, sees providers attempt to satisfy the desire for a data store with the best of both storage patterns. That means developers and data analysts get the reliability and structure found in a data warehouse with the scalability and agility of a data lake. The unstructured data is still there for AI and other data science purposes, but structured, schema-on-write data allows for quick reads. If your data flows into a lakehouse, your business functions are covered, whether you're a developer or a data scientist, performing ETL or ELT.
How do the two patterns get reconciled? Essentially, a warehouse layer exists on top of the lake, responsible for enforcing schemas for quality control and providing a foundation for BI and reporting. That layer also offers versioning, metadata controls, processing, and validation. As far as the actual data under the tool, however, anything goes: it's a mix of structured, unstructured, or somewhere in between.
Specific information on lakehouse trends (such as who is currently using them) is scarce. However, based on what we've already discussed, companies both mature enough to have BI needs and cutting-edge enough to invest in machine learning could benefit. Furthermore, with AI looming as a disrupter across industries, sectors from finance to healthcare to transportation can stand to benefit from this new approach to storage.
There are several options for building a lakehouse, so data managers have a choice when it comes to their preferred solution. There's Google Big Query, Azure Synapse, and Apache Drill. AWS even offers one, in their existing lake product Athena.
In a data lakehouse, ACID transactions remain in place, and BI tools such as Tableau can still pull from the data, so critical business decisions can get made. From a technical perspective, it also means a smaller list of hostnames to maintain. The data all reside on one platform, which means employees don't have to keep track of varying connection information for multiple platforms. Additionally, the security overhead, bloated infrastructure expenses, and duplicated data woes that come when maintaining both a lake and a warehouse get reduced.
They're also simple to install and easy to manage. Because of this, smaller companies with small data and ops teams can benefit from the additions and won't notice the compromises; the level of maturity where those would become a pain point is still past the horizon for them.
Simply put, the hybrid is in its early stages, with the term itself first appearing around 2017. According to Advancing Analytics, the lakehouse offerings - and as we've talked about, there are a few - haven't yet evolved to the level of functionality BI professionals expect. Although data lake platforms are adding these and other capabilities at the enterprise level, it could be at least a few years before a lakehouse can stand up to the mature databases.
And it goes the other way too: starting from the warehouse side as SQL Data Warehouse, the rebranded Azure Synapse now brings in Apache Spark, but it only offers select feature previews for now. As it stands, no single tool is ready to serve all job functions of an enterprise organization in the necessary capacity, and simplicity alone won't outweigh the need for performance.
Integrate.io can mitigate some of this through its solution, reducing mental overhead on the organization's part while still keeping an org's lake and warehouse logically separate.
Related reading: Data Lake Vs Data Warehouse: 7 Critical Differences
In this article, we've defined a data lakehouse and discussed its advantages and disadvantages. While a lakehouse might offer simplicity and flexibility, no single tool currently in the marketplace can handle the capacity needs of the enterprise quite yet. For startups and small companies that need both, however, it might be worth looking into. For the rest in the enterprise world, it's worth keeping an ear to the ground.
At Integrate.io, we can integrate data warehouses with data lakes to give you the benefits of a data lakehouse. See the integrations we offer and get started today!