Storing Apache Hadoop Data on the Cloud - HDFS vs. S3

History is full of great rivalries: France versus England, Red Sox versus Yankees, Sherlock Holmes versus Moriarty, Ken versus Ryu in Street Fighter... When it comes to Apache Hadoop data storage in the cloud, though, the biggest rivalry lies between the Hadoop Distributed File System (HDFS) and Amazon's Simple Storage Service (S3).

While Apache Hadoop has traditionally worked with HDFS, S3 also meets Hadoop's file system requirements. Companies such as Netflix have used this compatibility to build Hadoop data warehouses that store information in S3, rather than HDFS.

So what's all the hype about with S3, and is S3 better than HDFS for Hadoop cloud data storage? To understand the pros and cons of HDFS and S3, let's resolve this tech rivalry... in battle!

Before we get started, we'll provide a general overview of S3 and HDFS and the points of distinction between them. The main differences between HDFS and S3 are:

Difference #1: S3 is more scalable than HDFS.
Difference #2: When it comes to durability, S3 has the edge over HDFS.
Difference #3: Data in S3 is always persistent, unlike data in HDFS.
Difference #4: S3 is more cost-efficient and likely cheaper than HDFS.
Difference #5: HDFS excels when it comes to performance, outshining S3.

What Is HDFS?

HDFS (Hadoop Distributed File System) was built to be the primary data storage system for Hadoop applications. A project of the Apache Software Foundation, HDFS seeks to provide a distributed, fault-tolerant file system that can run on commodity hardware.

The HDFS layer of a cluster comprises a master node (also called a NameNode) that manages one or more slave nodes, each of which runs a DataNode instance. The NameNode keeps track of the data's location, while the DataNodes are tasked with storing and retrieving this data. Because files in HDFS are automatically stored across multiple machines, HDFS has built-in redundancy that protects against node failures and data loss.

What Is Amazon S3?

Amazon S3 (Simple Storage Service) is a cloud IaaS (infrastructure as a service) solution from Amazon Web Services for object storage via a convenient web-based interface. According to Amazon, the benefits of S3 include "industry-leading scalability, data availability, security, and performance."

The basic storage unit of Amazon S3 is the object, which comprises a file with an associated ID number and metadata. These objects are stored in buckets, which function similarly to folders or directories and which live within the AWS region of your choice.

Round 1: HDFS Versus S3: Scalability

The showdown over scalability comes down to the question of horizontal versus vertical scalability.

HDFS relies on local storage that scales horizontally. If you want to increase your storage space, you'll either have to add larger hard drives to existing nodes or add more machines to the cluster. This is feasible but more costly and complicated than S3.

S3 scales vertically and automatically according to your current data usage, without any need for action on your part. Even better, Amazon doesn't have predetermined limits on storage, so you have a practically infinite amount of space available.

Bottom line: The first round goes to S3, thanks to its greater scalability, flexibility, and elasticity.

Round 2: HDFS Versus S3: Durability

Data "durability" refers to the ability to keep your information intact long-term in cloud data storage, without suffering bit rot or corruption. So for durability, which is better: S3 or HDFS?

A statistical model for HDFS data durability suggests that the probability of losing a block of data (64 megabytes by default) on a large 4,000 node cluster (16 petabytes total storage, 250,736,598 block replicas) is 0.00000057 (5.7 x 10^-7) in the next 24 hours and 0.00021 (2.1 x 10^-4) in the next 365 days. However, most clusters contain only a few dozen instances, and so the probability of losing data can be much higher.

S3 provides the durability of 99.999999999 percent of objects per year. This means that a single object could be lost per 10,000,000 objects once every 10,000 years (see the S3 FAQ).

The news gets even better for S3 users. One of my colleagues at Integrat.io recently took an AWS workshop, and Amazon representatives reportedly claimed that they hadn’t actually lost a single object in the default S3 storage over the entire history of the service. (The cheaper Reduced Redundancy Storage (RRS) option, with the durability of only 99.99 percent, is also available.)

Bottom line: S3 wins again. Large clusters may have excellent durability, but in most cases, S3 is more durable than HDFS.

Round 3: HDFS Versus S3: Persistence

In the world of cloud data storage, "persistence" refers to the survival of data after the process that creates it has finished.

With HDFS, data doesn’t persist when stopping EC2 or EMR instances. However, you can use costly EBS volumes in order to persist the data on EC2.

On the other hand, data is always persistent in S3—simple as that.

Bottom line: S3 comes out on top this round: it offers out-of-the-box data persistence, and HDFS doesn't.

Round 4: HDFS Versus S3: Price

In order to preserve data integrity, HDFS stores three copies of each block of data by default. This means exactly what it sounds like: HDFS requires triple the amount of storage space for your data—and therefore triple the cost. While you don't have to enable data replication in triplicate, storing just one copy is highly risky, putting you in danger of data loss.

Amazon handles data backups on S3 itself, so you pay for only the storage that you actually need. S3 also supports storing compressed files, which can help slash your storage costs.

Another benefit of S3 is that there are multiple storage classes at different prices, depending on how you want to preserve and access your data. For example, the “Standard” class is intended for general-purpose storage, while the cheaper “Glacier” class is intended for long-term backups and archives that won’t be accessed frequently.

Bottom line: S3 is the clear winner for this one, thanks to the lower storage overhead costs.

Round 5: HDFS Versus S3: Performance

So far, the comparison between HDFS and S3 hasn't even been a competition—S3 comes out on top for scalability, durability, persistence, and price. But what about the question of performance?

The good news is that HDFS performance is excellent. Because data is stored and processed on the same machines, access and processing speed are lightning-fast.

Unfortunately, S3 doesn’t perform as well as HDFS. The latency is obviously higher and the data throughput is lower. However, jobs on Hadoop are usually made of chains of map-reduce jobs and intermediate data is stored into HDFS and the local file system so other than reading from/writing to Amazon S3 you get the throughput of the nodes' local disks.

We recently ran some tests with TestDFSIO, a read/write test utility for Hadoop, on a cluster of m1.xlarge instances with four ephemeral disk devices per node. The results confirm that HDFS performs better:

Bottom line: HDFS finally wins this round thanks to its strong all-around performance.

Round 6: HDFS Versus S3: Security

Some people think that HDFS isn't secure, but that’s a common misconception. Hadoop provides user authentication via Kerberos and authorization via file system permissions. Hadoop YARN takes this even further with a new feature called federations: dividing a cluster into several namespaces, restricting users to only the data to which they should have access. In addition, data can be uploaded to Amazon instances securely via SSL.

S3 also has built-in security. It supports user authentication to control data access. At first, only the bucket and object owners have access to data. Further permissions can be granted to users and groups via bucket policies and Access Control Lists (ACL). S3 also allows you to encrypt and upload data securely via SSL.

Bottom line: It’s a tie: both HDFS and S3 have robust security measures.

Round 7: HDFS vs. S3: Limitations

Even though HDFS can store files of any size, it has well-documented issues storing tiny files, which should be concatenated or unified to Hadoop Archives. (We’ve written more on the “small file problem” in our article “5 Ways to Process Small Data with Hadoop.”) In addition, data saved on a certain cluster in HDFS is only available to machines on that cluster, and cannot be used by instances outside the cluster.

That’s not the case with S3—data is independent of Hadoop clusters, and can be processed by many clusters simultaneously. However, files on S3 have several limitations of their own. The maximum file size is only 5 gigabytes, and additional Hadoop storage formats (such as Parquet or ORC) cannot be used on S3. This is because Hadoop needs to access particular bytes in these files, an ability that’s not provided by S3.

Bottom line: Another tie: both options have limitations that you should know before choosing between HDFS and S3.

HDFS vs. S3: Who Wins?

With better scalability, built-in persistence, and lower prices, S3 is tonight’s winner! Still, HDFS comes away with a couple of important consolation prizes. For better performance and no limitations on file size or storage format, HDFS is the way to go.

But whether you use HDFS or Amazon S3, you need a mature, feature-rich data integration platform that can help you integrate your data storage with your data destinations. The good news is that you don’t have to look far. Integrat.io is a powerful, highly secure ETL solution that makes it easy to build data pipelines between your sources and your cloud data warehouse or data lake.

Thanks to Integrat.io’s no-code, drag-and-drop user interface, it’s never been simpler for anyone in your organization to design and deploy sophisticated data integration pipelines. What’s more, Integrat.io includes more than 140 pre-built connectors to join virtually any source and destination.

Ready to learn how Integrat.io can help your organization get better, smarter use from your data? Get in touch with our team today for a chat about your business needs and objectives, or to start your 7-day pilot of the Integrat.io platform. You can also read about more ETL and data integration topics on the Integrat.io blog.

Cloud Integration

Storing Apache Hadoop Data on the Cloud - HDFS vs. S3

Table of Contents

What Is HDFS?

What Is Amazon S3?

Round 1: HDFS Versus S3: Scalability

Round 2: HDFS Versus S3: Durability

Round 3: HDFS Versus S3: Persistence

Round 4: HDFS Versus S3: Price

Round 5: HDFS Versus S3: Performance

Round 6: HDFS Versus S3: Security

Round 7: HDFS vs. S3: Limitations

HDFS vs. S3: Who Wins?

MuleSoft vs. Integrate.io: Comparison and Review

Streamline Your Data: Master Your Integration with the NetSuite ODBC Driver

From Theory to Practice: Real-World Applications of Cloud Platform Integration

Solutions

Support

Company

Language

Storing Apache Hadoop Data on the Cloud - HDFS vs. S3

Table of Contents

What Is HDFS?

What Is Amazon S3?

Round 1: HDFS Versus S3: Scalability

Round 2: HDFS Versus S3: Durability

Round 3: HDFS Versus S3: Persistence

Round 4: HDFS Versus S3: Price

Round 5: HDFS Versus S3: Performance

Round 6: HDFS Versus S3: Security

Round 7: HDFS vs. S3: Limitations

HDFS vs. S3: Who Wins?

Related Readings

MuleSoft vs. Integrate.io: Comparison and Review

Streamline Your Data: Master Your Integration with the NetSuite ODBC Driver

From Theory to Practice: Real-World Applications of Cloud Platform Integration

Subscribe To The Stack Newsletter

Solutions

Support

Company

Language

Subscribe To
The Stack Newsletter