We are deep in the Big Data jungle. With so much hype about so many technologies, it’s easy to get lost. It takes an expert with a great big machete to cut through the vines and navigate this wild territory, one who knows what all these platforms do—not to mention how to use them. So we brought an expert. Big Data consultant David Gruzman answered some of our burning questions about which Big Data platform to use, whether streaming is a must or not, and what are the biggest issues with the cloud.

Spark, Impala, Tez, Hive: which ones should be used for which use cases?

Since they have a lot in common, I will try to identify the best use cases for each platform.

For Spark, the best use cases are interactive data processing and ad hoc analysis of moderate-sized data sets (as big as the cluster’s RAM). Spark’s ability to reuse data in memory really shines for these use cases.

Impala is the only native open-source SQL engine in the Hadoop family, so it is best used for SQL queries over big volumes. It is also capable of delivering results interactively, with an excellent speed and over bigger volumes than other in-Hadoop query engines.

Tez may be considered as a good base for implementing computing engines like Hive and Pig. It has low-level APIs and allows many optimizations that are needed for data processing.

Hive is the most mature platform from all of the above, as well as the slowest one. It can still be a good choice for heavy ETL tasks where reliability is important, for example, hourly log aggregations for advertising companies.

Spark seems to attract the most hype at the moment. Do you think that it’s justified, or will another platform become the new Big Data standard?

I think that it’s justified. Spark has a really easy and usable interface without sacrificing performance—both in terms of latency and throughput. I believe that it will replace MapReduce in many use cases. Nonetheless, there is a lot of MapReduce code out there that isn’t compatible with Spark, so these changes will take time.

Everyone wants streaming these days. Is it really necessary to have streaming Big Data processing? How complicated is it to set it up? And are organizations missing something important when they want to get rid of their batch processing in favor of streaming?

I am not sure—it’s hard for me to imagine that real-time data has become that big. One thing that has become important, is accessing real-time and historical data simultaneously to provide a better picture of the situation: to compare what’s happening now to how it looked one hour or one week ago, for example, when comparing ad revenue.

I doubt that it will be possible to get rid of batch processing in favor of real time, because batch processing is inherently more efficient. Some shallow analyses could be done in real time, while the deeper analysis will be left to offline batch processing, especially since it may require processing much more data than can be processed in real time.

Could you tell us about ImpalaToGo and why it was created? How is it different from Presto, Vertica, and Redshift?

ImpalaToGo is a fork of Cloudera Impala, so it is also an SQL engine. It lets users enjoy Impala’s engine performance without having to manage an entire Hadoop stack. ImpalaToGo differs from other massive parallel processing (MPP) databases by being optimized for the cloud. While other databases use local drives as the main storage, ImpalaToGo uses Amazon S3 as the main storage and local drives for caching. Depending on the query, this improves performance 3–15 times.

As a Hadoop consultant, what do you think is going to happen with the Hadoop skill gap in the near future?

On one hand, MapReduce and HDFS have been assimilated into the industry. On the other hand, several new technologies, like Spark and Tez, are more complicated internally than MapReduce, which makes troubleshooting and fine-tuning a lot more difficult. So, I do not expect the skill gap to shrink much.

What problems does the cloud have at the moment?

In terms of Big Data, I see inherent problems of having data stored in one place—like Amazon S3—and processed on another, like EC2. It contradicts one of the main principles of Big Data processing: bring the code to the data, not the data to the code. I hope technologies like HGST Open Ethernet Storage Architecture and ZeroVM will help to resolve this—instead of pulling all the data to a computing tier, they run the code inside an object store.

Where do you see the cloud going in 2015?

It’s hard to say, but it looks like the cloud is becoming more mature and cheaper, and there are fewer reasons to have your own data center, rent racks or buy servers.


With about 20 years of experience in the industry, David Gruzman has been working as a Hadoop and Big Data consultant for the last five years. He was deeply involved in two startup companies that were related to Big Data processing: Petascan and LiteStack. He’s currently working on his latest venture ImpalaToGo.