This is a guest post for Integrate.io written by Bill Inmon, an American computer scientist recognized as the "father of the data warehouse." Inmon wrote the first book and first magazine column about data warehousing, held the first conference about this topic, and was the first person to teach data warehousing classes.
Five things to know about this topic:
- Data architecture has evolved over the years. Before data warehouses, simple applications handled data.
- Data warehouses revolutionized data architecture by creating large volumes of structured data.
- Textual ETL allows businesses to handle unstructured data.
- Data lakehouses and machine-generated data also transformed data architecture.
- Integrate.io is a data warehousing solution that can handle your data integration requirements.
An evolution of data architecture has been occurring. First, there were simple applications. Then, there were data warehouses. Then, people added text to data warehouses. Now, there are data lakehouses. Each of these transformations has had its own similarities and peculiarities. This article examines the architectural transformations occurring today
Data architecture began innocently enough in the 1960s with the advent of the first application, and it has been evolving ever since. Most evolutions occur at glacial speed. The evolution of data architecture has proceeded at the speed of light. This article describes that evolution and the state of affairs in today’s world.
Table of Contents
- Before the Data Warehouse
- The Rise of the Data Warehouse
- The Evolution of the Warehouse
- Challenges of Text
- How Textual ETL Helps
- Combining Textual Data and Structured Data
- Machine-Generated Data
- Challenges of Machine-Generated Data
- Data Warehouse vs. Data Lakehouse
Integrate.io has seen the evolution of both the data warehouse and data lakehouse. This data warehouse integration tool can integrate warehouses with lakes to give your business all the benefits of a lakehouse. Integrate.io also streamlines the data integration process with ETL, Reverse ETL, and super-fast Change Data Capture (CDC) tools. Why not try Integrate.io yourself with a 14-day free trial?
Before the Data Warehouse
In the beginning, there were applications (Jarke, et al. 2000). These applications greatly relieved burdensome work (Gould, et al., 1991). Applications were created by studying end-user requirements, then tailoring the application to the requirements. To expedite the development process, the requirements gathered were very specific to the end user’s immediate needs (Gould, et al., 1991). Soon, there were lots of applications in organizations (DeLone, 1988; VanLommel and DeBrabander, 1975).
Then one day, someone wanted to find data not for a given application but for across the entire organization. There was no shortage of data. Data was everywhere (McCune, 1998). The problem was that the same element of data appeared in multiple places, each with a different value for the data (Codd 1970; Date, 1986; Ullman, 1982). Analytic processing was extremely difficult because no one was ever sure what value was right and what value was wrong. And this made analytical results very questionable (Chen et al., 2017).
In hindsight, if gathering requirements included an enterprise perspective, then there might not have been such a logjam (Inmon, 2005). But, people repeatedly narrowed requirements because of the need to meet a deadline for the development of an application.
The Rise of the Data Warehouse
Into this world of questionable values of data came the data warehouse. Data warehouses vetted and integrated data, and the result was a corporate understanding of data. This corporate understanding of data created an analytical environment where management could rely on the data being analyzed (Inmon, 2005).
The data warehouse was (and still is) defined as a subject-oriented, integrated, non-volatile, time-variant collection of data that supports managerial decisions.
In addition to containing the vetted data for the corporation, the data warehouse included a lengthy historical record of data. Typically, the data warehouse holds 5-10 years' worth of data (Inmon, 2005).
Designers typically build data warehouses using a transformation process called ETL (Extract, Transform, Load). Data in the applications world transforms into a corporate mold (Golfarelli and Rizzi, 2009). The application designer can select any interpretation of data that they wish. However, the corporate understanding of data requires a single interpretation across all of the corporation (Inmon, 2005).
As a simple example of the transformation of application data to corporate data, suppose there are three applications. For example:
- Application ABC
- Application BCD
- Application CDE
- Gender male/female
- Gender – m/f
- Gender 1/0
- Measurement – inches
- Measurement – cms
- Measurement – inches
- Dollars – Australian
- Dollars – Canadian
- Dollars - US
When the application data passes through ETL, data converts to a single corporate interpretation:
- Gender – m/f
- Measurement – inches
- Dollars - US
ETL produces data that has a single corporate interpretation.
Integrate.io helps you move data to a supported warehouse via native out-of-the-box no-code connectors. You can ETL data from a source to a target destination without the hassle. Integrate.io also performs ELT, Reverse ETL, and super-fast CDC. Try Integrate.io for yourself with a 14-day free trial.
The Evolution of the Warehouse
In the early days of data warehouses, programmers manually produced ETL programs (Inmon, 2005). But quickly, there grew an industry of ETL software that automatically produced data warehouses (Vassiliadis, 2009).
Data warehouses revolutionized the business intelligence industry. Doing business intelligence before the data warehouse was a hit-and-miss proposition. But with the advent of the data warehouse, business intelligence had a foundation on which to thrive (Almeida, et al., 1999).
Data warehouses were ubiquitous in that they applied to all industries and organizations (Ariyachandra and Watson, 2008). Data warehousing applies to retailers, manufacturers, insurance companies, banks and finance, government agencies, hospitality organizations, airlines, sports organizations, entertainment, and many others (Chaudhuri and Dayal, 1997). Data warehouses are found on all continents of the earth (Salinesi and Gam, 2006; Gould, et al., 1991).
One characteristic of data warehousing is that it created volumes of data never before imagined. Data warehousing stored historical data — something rarely done for transaction-based systems (Kimball, et al., 2008).
Data warehousing was a concept and was not owned by any vendor. Vendors supplied different aspects of a data warehouse, but at no time did any vendor own a warehouse.
For all of the advantages of the data warehouse, there were some limitations. One limitation was that the data warehouse handled structured data exclusively. Structured data was typically transaction-based, meaning it could be gathered and stored in a highly structured manner (Inmon, 2005).
The problem was that a lot of other data exists in an unstructured format. Principally, there is text, and text is notoriously unstructured. In addition, text is very complex. Compounding the problem is the fact that there is a tremendous business opportunity found in text. There are medical records, call center conversations, contracts, emails, and many places where text-based data sets exist. And these places have only been scantily explored and exploited in the past (Blumberg and Are, 2003).
Into this equation comes the technology known as textual ETL. Textual ETL performs the job of disambiguating text. Once text becomes disambiguated, it can go through analysis.
Challenges of Text
There are many challenges in the disambiguation of text (Cembalo, et al., 2012). The first and foremost challenge is that of identifying the context of text. You cannot do a serious job of disambiguation unless you address both text and context. Textual ETL does exactly that (Inmon, 2018).
But there are other challenges. One is that of properly managing predictable text. Text can exist in two classes: predictable text and unpredictable text (O’Brien and Myers, 1985). Most text is unpredictable. But some text is predictable. And identifying the context of predictable text is very different from the identification of the context of unpredictable text (Inmon, 2015).
Taxonomies and ontologies are useful in managing the disambiguation of unpredictable text whereas inline contextualization is useful in managing the disambiguation of predictable text (Inmon, 2017). And taxonomies and ontologies are as different from inline contextualization as chalk is for cheese. There simply is little or no similarity between ontologies and inline contextualization.
And there are other struggles when it comes to text. Merely accessing the text is a challenge (Akilan, 2015). The internet has its own set of considerations (Chen, et al., 2015). Emails have a different set of considerations (Jlailaty, et al., 2018). Voice-to-text transcription has its own set of considerations (Botzenhardt, et al., 2011). Optical character recognition has yet another set of considerations (Hamad and Mehmet, 2016). Each of the different media on which text exists has its own set of unique considerations that must be taken into account before any disambiguation can occur.
Further complicating things is the fact that text comes in different languages. And each language has its own set of considerations. There is the alphabet. There are idioms. There are dialects. There is language structure. There is the direction the text is written in. In short, dealing with text is a complex task.
How Textual ETL Helps
There is technology that takes into account all of the above considerations. That technology is textual ETL, which reads unstructured text and turns it into a database, structured format. Textual ETL considers text and context, taxonomies and ontologies, language differences, alphabet differences, etc. Textual ETL results in a neatly structured database as output (Inmon and Nesavich, 2007).
Once the text comes out as a database, it can then go through analysis with standard analytical tools. Textual ETL then produces output that allows text – in the form of a database – to enter a data warehouse. Doing this can dramatically increase the range of opportunities afforded by a data warehouse (Inmon and Nesavich, 2007).
Combining Textual Data and Structured Data
There are some issues with combining textual data with classical structured data. The issues center around finding a common set of attributes to do analytics around. Most text — conversations, articles, etc. — do not have the key structure information found in structured data. So, in many cases, comparing textual data to structured data is difficult, even when the textual data can be rendered into a database format (Inmon and Krishnan, 2011).
Nevertheless, the ability to add textual data in a format for analysis enhances the range of possibilities for a data warehouse (Inmon and Krishnan, 2011).
But there is yet another type of data found in the corporation. That data is machine-generated data, which is data created and transmitted mechanically.
There are many different kinds of machine-generated data, such as:
- Manufacturing control machines that measure all sorts of things about manufactured material
- Transportation machines that measure the activity taking the place of specified property
- Drones that keep track of parking at retail stores
- Telemetry information coming from aeronautics, such as airplane black boxes.
Machine-generated data has some unique properties. Much of it has little or no value. But some machine-generated data has great value (Taleb, et al, 2018).
Another characteristic of machine-generated data is that huge amounts of data can be generated. The amount of data generated by a machine eclipses the amount of data generated by both text and structured data. The sheer volume of data from a machine presents its own challenges (Eberendu, 2016).
In the early days of machine-generated data, it went into a data lake. A data lake was merely a holding place for the machine-generated data. To use the data in the data lake, the data lake needed an infrastructure placed over it (Armbrust, et al, 2021; Inmon, Building the Data Lakehouse).
The purpose of this infrastructure is to expedite the analytical usage of data found in the data lake. Some of the elements of the infrastructure include:
- Data relationships
- Summarization algorithms
Challenges of Machine-Generated Data
One of the biggest challenges the data scientist faces is identifying and separating useful and extraneous data from machine-generated data (Inmon, 2016). To understand the issues facing the data scientist, consider a surveillance camera that looks at people entering and leaving a store doorway. The camera is turned on 24 hours a day and takes an image every tenth of a second.
One day, the store manager needs to look at the surveillance data because of a break-in. The manager searches for someone entering the building that should not be there. The manager must go through a huge number of images to find the one image they are looking for. There might be some very important data in the images, but there is a huge amount of data hiding the one important image.
This same problem — large amounts of data where only a small amount is important — recurs in almost every case where there is machine-generated data (Inmon, 2016).
When you place the analytical structure over the data lake, you can call the end result the data lakehouse. Once you create the analytical infrastructure, the data lakehouse becomes the place to turn to for end-user access and analysis (L’Esteve, 2021; Shiyal, 2021).
Can you put data other than machine-generated data in a data lakehouse? Of course. You can place textual data, structured data, and other data types (Shiyal, 2021). And in many cases, it is advantageous to do so.
Once you create the analytical infrastructure for the data lakehouse, you can blend the data found in the lakehouse with data in the data warehouse (Shiyal, 2021). In doing so, you can create a powerful new kind of analytics.
Data Warehouse vs. Data Lakehouse
One interesting question is: "Is a data warehouse the same thing as a data lakehouse?" To understand this question, consider two cousins sitting in a room. One is a young man named Jack, and the other is a young lady named Edith. Are Jack and Edith the same thing? Certainly, they have a lot of similarities. Their faces look similar. Their body types are similar. Their ethnic origins are similar. But there are differences as well. Jack is a male, and Edith is a female. They certainly share a common DNA, but they do not have the same DNA.
So, despite their similarities, Jack and Edith are not the same person.
The sample principle applies when comparing the data warehouse with the data lakehouse.
Evolutions do not end. And the data lakehouse will not be the end of the evolution of data architecture. But the data lakehouse represents the evolution of data architecture as people know it today.
Bill Inmon, the father of the data warehouse, has authored 65 books. Computerworld named him one of the ten most influential people in the history of computing. Inmon's Castle Rock, Colorado-based company Forest Rim Technology helps companies hear the voice of their customers. See more at www.forestrimtech.com.
Data warehouses, data lakes, and databases are core components of your technical architecture. Integrate.io's philosophy is to simplify data integration, allowing you to move data from sources to a supported target system without advanced coding or data engineering. Then you can generate unparalleled insights about your business for better decision-making. Schedule a demo now.
- Akilan, A. (2015, February). Text mining: Challenges and future directions. In 2015 2nd International Conference on Electronics and Communication Systems (ICECS) (pp. 1679-1684). IEEE.
- Almeida, M. S., Ishikawa, M., Reinschmidt, J., & Roeber, T. (1999). Getting started with data warehouse and business intelligence. IBM Redbooks.
- Armbrust, M., Ghodsi, A., Xin, R., & Zaharia, M. (2021). Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. CIDR.
- Ariyachandra, T., & Watson, H. J. (2008). Technical opinion Which data warehouse architecture is best?. Communications of the ACM, 51(10), 146-147.
- Blumberg, R., & Atre, S. (2003). The problem with unstructured data. Dm Review, 13(42-49), 62.
- Botzenhardt, A., Witt, A., & Maedche, A. (2011). A Text Mining Application for Exploring the Voice of the Customer.
- Cembalo, A., Pisano, F. M., & Romano, G. (2012, July). An approach to document warehousing system lifecycle from textual ETL to multidimensional queries: A proof-of-concept prototype. In 2012 Sixth International Conference on Complex, Intelligent, and Software Intensive Systems (pp. 828-835). IEEE.
- Chaudhuri, S., & Dayal, U. (1997). An overview of data warehousing and OLAP technology. ACM Sigmod record, 26(1), 65-74.
- Chen, F., Deng, P., Wan, J., Zhang, D., Vasilakos, A. V., & Rong, X. (2015). Data mining for the internet of things: literature review and challenges. International Journal of Distributed Sensor Networks, 11(8), 431047.
- Chen, Q., Zobel, J., & Verspoor, K. (2017). Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study. Database, 2017.
- Codd, E.F. (1970). A relational model for large shared data banks. Communications of the ACM, 13, 377-387.
- Date, C. J. (1986). An introduction to database systems (4th ed., vol. 1). Reading, MA: Addison-Wesley.
- DeLone, W. H. (1988). Determinants of success for computer usage in small business. MIS quarterly, 51-61.
- Eberendu, A. C. (2016). Unstructured Data: an overview of the data of Big Data. International Journal of Computer Trends and Technology, 38(1), 46-50.
- Gould, J. D., Boies, S. J., & Lewis, C. (1991). Making usable, useful, productivity-enhancing computer applications. Communications of the ACM, 34(1), 74-85.
- Hamad, K. A., & Mehmet, K. A. Y. A. (2016). A detailed analysis of optical character recognition technology. International Journal of Applied Mathematics Electronics and Computers, (Special Issue-1), 244-249.
- Inmon, W. H. (2005). Building the data warehouse. John Wiley & Sons.
- Inmon, B. (2016). Data Lake Architecture: Designing the Data Lake and avoiding the garbage dump. Technics Publications.
- Inmon, B. (2017). Turning text into gold: Taxonomies and textual analytics. Technics Publications.
- Inmon, W H (2019) Class notes, Practical Textual Analytics, by Forest Rim Technology, Denver, Colorado
- Inmon, B., & Krishnan, K. (2011). Building the Unstructured Data Warehouse: Architecture, Analysis, and Design. Technics publications.
- Inmon, W. H., & Nesavich, A. (2007). Tapping into Unstructured Data: Integrating Unstructured Data and Textual Analytics into Business Intelligence. Pearson Education.
- Jarke M., Lenzerini M., Vassiliou Y., Vassiliadis P. (2000) Data Warehouse Practice: An Overview. In: Fundamentals of Data Warehouses. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-04138-3_1
- Jlailaty, D., Grigori, D., & Belhajjame, K. (2018, May). Email business activities extraction and annotation. In International Workshop on Information Search, Integration, and Personalization (pp. 69-86). Springer, Cham.
- Kimball, R., Ross, M., Thorthwaite, W., Becker, B., & Mundy, J. (2008). The Data Warehouse Lifecycle Toolkit. John Wiley & Sons.
- L’Esteve, R. C. (2021). Delta Lake. In The Definitive Guide to Azure Data Engineering (pp. 321-346). Apress, Berkeley, CA.
- McCune, J. C. (1998). Data, data, everywhere. Management Review, 87(10), 10.
- O'Brien, E. J., & Myers, J. L. (1985). When comprehension difficulty improves Memory for Text. Journal of Experimental Psychology: Learning, Memory, and Cognition, 11(1), 12.
- Salinesi, C., & Gam, I. (2006, June). A requirement-driven approach for designing data warehouses. In Requirements Engineering: Foundations for Software Quality (REFSQ'06) (p. 1). M. Golfarelli, S. Rizzi, Data Warehouse Design - Modern Principles and methodologies", McGraw-Hill, 2009.
- Shiyal, B. (2021). Modern Data Warehouses and Data Lakehouses. In Beginning Azure Synapse Analytics (pp. 21-48). Apress, Berkeley, CA.
- Taleb, I., Serhani, M. A., & Dssouli, R. (2018, November). Big data quality assessment model for unstructured data. In 2018 International Conference on Innovations in Information Technology (IIT) (pp. 69-74). IEEE.
- Ullman, J.D. (1982). Principles of database systems. Rockville, Maryland: Computer Sciences Press.
- Vassiliadis, P. (2009). A survey of extract–transform–load technology. International Journal of Data Warehousing and Mining (IJDWM), 5(3), 1-27.
- Vanlommel, E., & De Brabander, B. (1975). The organization of electronic data processing (EDP) activities and computer use. The Journal of Business, 48(3), 391-410.
- Weinberg, Gerald (1971), THE PSYCHOLOGY OF COMPUTER PROGRAMMING,