From system logs to web scraping, there are many good reasons why you might have extremely large numbers of small data files at hand. But how can you efficiently process and analyze these files to uncover the hidden insights that they contain?
You might think that you could process these small data files using a solution like Apache Hadoop, which has been specifically designed for handling large datasets. However, Hadoop has a certain infamous technical quirk known as the “small file problem.” This makes Hadoop much better suited for handling a single large file than it is handling the same file split up into many smaller ones.
The good news is that you can still use Hadoop to process your small data—you just might have to get a little creative. Below, we’ll go over 5 different ways that you can process small data files with Hadoop.
Table of Contents
- What is Small Data?
- What is Hadoop?
- Why Process Small Data with Hadoop? The "Small Files Problem"
- 5 Ways to Process Small Data with Hadoop
What is Small Data?
You’ve heard about “big data”—so what is small data? Big data is, by definition, too massive in terms of its volume, velocity, variety, and/or veracity in order to be used and managed by human beings in their raw state. Working with big data thus requires automated systems that are specifically designed for collecting, processing, and analyzing large datasets.
On the other hand, small data is data that is of a small enough size to be accessible and actionable, i.e. able to be used, managed, and understood by human beings. “Anything that can fit in an Excel file” might be a workable definition of small data. Examples of small data include sports scores, quarterly reports, weather forecasts, and text articles scraped from Wikipedia.
Despite their opposite names, big data and small data don’t need to be at odds with each other. When small data consists of many different files, each one of a small size, it can become large and complex enough that it requires the use of big data tools to handle—in fact, that’s precisely the situation we’ll be addressing here.
What is Hadoop?
Apache Hadoop is a collection of open-source software that has been purpose-built to process extremely large quantities of data. First developed at Yahoo by co-founders Doug Cutting and Mike Cafarella, Hadoop is now released and maintained by the Apache Software Foundation, a non-profit organization for building and distributing open-source software. Companies such as Cloudera have also built on top of the Hadoop framework to offer their own enterprise version of Hadoop.
One of the greatest benefits of Hadoop is that it can run on commodity hardware—i.e. IT equipment that is relatively inexpensive and widely available—instead of requiring massively powerful supercomputers. More specifically, Hadoop works by sharing portions of the data across many different computers that are organized into clusters; each cluster operates as a single unit that contains multiple storage and processing units in a single location. Hadoop, therefore, has tremendous scalability, with potentially thousands of computers working together to process very large data sets.
Another advantage of Hadoop is that it treats hardware failure as highly likely if not inevitable, which is a consequence of working with commodity hardware. As such, Hadoop automatically replicates the data for redundancy, fault tolerance, and high availability. Data is replicated across various nodes (i.e. computers), both within the same cluster and in different clusters.
“Hadoop” is actually a catchall term for the five different modules (i.e. components) that make up the Hadoop open-source project:
- Hadoop Common: The main libraries and utilities that are essential for the functioning of the Hadoop program and the other Hadoop modules.
- Hadoop Distributed File System (HDFS): A distributed file system written in the Java programming language for storing data across different computers in multiple locations. HDFS has high throughput (i.e. the amount of work done per unit time) thanks to its "write once, read many" data storage model.
- Hadoop MapReduce: A Hadoop implementation of the MapReduce paradigm for parallel processing of large datasets. In MapReduce, data is first split into independent chunks to which some function is applied (the map stage) and then combined in order to reduce the dataset’s size (the reduce stage). Hadoop can run MapReduce jobs written in multiple programming languages, including Java, Ruby, Python, and C++.
- Hadoop YARN ("Yet Another Resource Negotiator"): A tool for Hadoop cluster management, resource management, and job scheduling.
- Hadoop Ozone: A scalable distributed object store that is intended for handling both large and small files (unlike HDFS, which is optimized for large files only).
Although these five modules are the core of the Hadoop project, the Hadoop ecosystem is often considered to include other adjacent projects—Apache HBase, Apache Pig, Apache Sqoop, Apache Flume, Apache ZooKeeper, Apache Nutch, Apache Oozie, etc.—that work directly with Hadoop. These days, Hadoop is widely applied to use cases such as big data analytics and machine learning—anything that can benefit from massively distributed computing.
Hadoop's main "competitors" include Apache Spark, which extends the functionality of Hadoop MapReduce to include real-time stream processing and interactive queries (see our article “Spark vs. Hadoop MapReduce”). We’ve also written extensively on Hadoop in articles like “Hadoop vs. Redshift,” “Integrating Relational Databases with Apache Hadoop,” and "12 SQL-on-Hadoop Tools," so check those out if you’re curious.
For information on Integrate.io's native Hadoop HDFS connector, visit our Integration page.
Why Process Small Data with Hadoop? The “Small Files Problem”
As you can imagine, Hadoop is quite a powerful tool for crunching extremely large amounts of data. But what about computation to perform small data processing with Hadoop? It turns out that this is much less efficient, due to certain technical limitations of the Hadoop platform.
In Hadoop, a “small file” is defined as one that is smaller than the block size in HDFS, which is typically 64 or 128 megabytes. (Yes, we’re speaking in relative terms here.) But HDFS was explicitly built for working with very large files, and that intention expresses itself in one crucial technical detail.
The NameNode in HDFS is the master server that manages the file system namespace and stores metadata, while DataNodes are the slave nodes. Every file, directory, and block in HDFS occupies roughly 150 bytes in the NameNode’s memory. Doing a little back-of-the-envelope calculation, if you have on the order of 10 million files, each one taking up 150 bytes, you’ll have 10 million file INodes and 10 million blocks, for a total of 20 million * 150 bytes = 3 gigabytes. Having a large number of files quickly becomes untenable in Hadoop due to memory constraints, even on upscale commodity hardware.
5 Ways to Process Small Data with Hadoop
Due to the Hadoop “small files problem,” it’s not advisable or realistic for data scientists to handle many millions of small files using Hadoop. The good news is that there are still multiple options for processing small data with Hadoop—5 of which we’ll outline below.
1. Concatenating text files
Perhaps the simplest solution for processing small data with Hadoop is to simply concatenate together all of the many small data files. Website logs, emails, or any other data that is stored in text format can be concatenated from many small data files into a single large file.
The easiest way to concatenate text files is with the “cat” terminal command in Unix-based operating systems (macOS and Linux), or with the copy command in Windows. Because Hadoop processes data line by line, the information will be handled exactly the same way before and after concatenation.
2. Hadoop archives
Concatenation works well for processing small text files with Hadoop, but what about binary data (e.g. images and videos)? In this case, you might be able to use Hadoop archives, which are special format archives with the *.har extension that map to a file system directory.
Using Hadoop archives, you can combine small files from any format into a single file via the command line. HAR files operate as another file system layer on top of HDFS, so the archived files can also be accessed directly using har:// URLs.
Below is an example of how to create a Hadoop archive using the command line:
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo
- “-archiveName” specifies the name of the archive (e.g. “foo.har”)
- “-p” specifies the relative path to the location of the directories to archive (e.g. “/user/hadoop”)
- “dir1,” “dir2,” etc. specify the directories to place in the archive (in this case, “/user/hadoop/dir1” and “/user/hadoop/dir2”)
- “/user/zoo” specifies the location where the archive will be created (e.g. “/user/zoo/foo.har”)
Parquet is an open-source columnar storage format that is available for any project in the Hadoop ecosystem, including offshoots such as Apache Hive and Apache Impala as well.
Instead of a traditional row-based format such as CSV or Avro, Parquet is column-oriented: the values of each table column are physically stored next to each other in contiguous memory. This approach has advantages such as more efficient use of storage and improved performance for queries on specific column values. While other file formats such as RC and ORC also use columnar storage, Parquet claims to have better performance due to efficient compression and encoding schemes.
4. Hadoop Ozone
If you like taking risks and trying new things, Hadoop Ozone might be the ideal solution for processing small files with Hadoop. Released in September 2020, Hadoop Ozone is the Hadoop project’s answer to the infamous small files limitation in HDFS. Ozone is intended to gracefully manage both small and large files, and can supposedly contain more than 100 billion objects in a single cluster.
Last but certainly not least, we’d be remiss if we didn’t mention that you can also use Integrate.io to process small data with Hadoop. Integrate.io runs Hadoop under the hood and includes automatic optimization for small files in the cloud.
Integrate.io makes it easy for users to process small data with Hadoop. First, Integrate.io copies files to the cluster’s local HDFS and optimizes them during the copy process. The files are automatically deleted once the cluster has finished processing the data.
To take advantage of this feature, use Integrate.io’s file storage source component for reading multiple files in object stores (e.g. Amazon S3 or Google Cloud Storage). You’ll either need to create a new file storage connection or select an existing connection. In the connection properties, look for “Source action” and select “Copy, merge and process all files,” which is intended for working with many small files. When this option is selected, Integrate.io will first read all of your files, then merge them into larger files and process them.
As you can see, there's no shortage of options for processing small data with Hadoop, including using Integrate.io. But being able to perform Hadoop small data processing is just one of Integrate.io’s many nice qualities. Integrate.io is a feature-rich and user-friendly ETL solution that makes it easier than ever for businesses to integrate their enterprise data. Thanks to Integrate.io’s drag-and-drop visual interface and more than 100 built-in connections, even non-technical users can build robust data pipelines to their cloud data warehouse.
Ready to learn how Integrate.io can benefit your organization? Schedule a call with the Integrate.io team for a chat about your business needs and objectives, or to start your 14-day risk-free trial of the Integrate.io platform.