Table of Contents
Here at Integrate.io, we talk to companies every day who are in the early phases of building out their data infrastructure. A lot of times, these conversations include the topic of which technology to pick for which job. For example, we often get the question “Apache Spark vs. Amazon Redshift: Which one is better?”, or “Should we be using Redshift or Spark?”
In fact, Spark and Redshift are two very different technologies. It’s not an either/or proposition—it’s more like a “When do I use what?” question. In this post, I’ll lay out some of the differences and when to use each one.
As of this writing, if you look under the hood of the hottest tech startups in Silicon Valley, you’ll likely find both Spark and Redshift. Spark is getting a little bit more attention these days because it’s a new shiny toy, but the two of them cover different use cases (i.e. the “dishwasher vs. fridge” analogy, per Ricardo Vladimiro).
Below, we’ll go over the question of Spark vs. Redshift performance in terms of architecture and engineering, among others.
Apache Spark vs. Amazon Redshift: What Are They?
Apache Spark is a data processing engine. The Spark project comes with a general execution engine (Spark Core), on top of which all other functionality is built. With Spark you can:
- Process batch and streaming workloads in real-time.
- Write applications in Java, Scala, Python, and R.
- Use pre-built libraries (such as Spark SQL and MLlib) for creating said apps.
People are excited about Spark for three reasons:
- Spark is fast because it distributes data across a cluster and processes that data in parallel. It tries to process data in memory, instead of shuffling things in and out of disk (like e.g. MapReduce does).
- Spark is easy because it has a high level of abstraction, letting you write applications with fewer lines of code. Plus, Scala and R are attractive languages for data manipulation.
- Spark is extensible via the pre-built libraries, e.g. for machine learning, streaming apps, or data ingestion. These libraries are either part of Spark or third-party projects
In short, the promise of Spark is to speed up development, make applications more portable and extensible, and make the actual application run faster.
A few more noteworthy points on Spark:
- Spark is open-source, so you can download it and start running it yourself, e.g. on Amazon EC2. However, companies like Databricks (founded by the same people who created Spark) offer support plans to make your life easier.
- As Kiyoto Tamura mentions, Spark “is NOT a database”. You will need some sort of persistent data storage that Spark can pull data from (i.e. a data source, such as Amazon S3 or – hint, hint – Redshift).
- After pulling data from storage, Spark then reads it into memory to process it. Once that’s done, Spark will require a place to store and/or pass on the results (because, again, it’s not a database). This could be back into S3, and from there into e.g. Redshift (see where this answer is going?).
- You need to know how to write code to use Spark (that’s where the “write applications” part comes in). This means that the people who use Spark are typically developers.
You need to know how to write code to use Spark (the “write applications” part). So the people who use Spark are typically developers.
Amazon Redshift is an analytical database. With Redshift you can:
- Build a central data warehouse that unifies data from many sources.
- Run big, complex analytic queries against said data with SQL.
- Report and pass on the results to dashboards or other apps
Redshift is a managed service provided by Amazon. Raw data flows into Redshift (through a process called “ETL”), where it’s converted and transformed at a regular cadence (“transformation” or “aggregations”), or on an ad hoc basis (“ad hoc queries”). Another term for the loading and transforming process is the “data pipeline”.
People are excited about Redshift for three reasons:
- Redshift is fast because of its massively parallel processing (MPP) architecture that distributes and parallelizes queries. Redshift allows a high query concurrency, and it also processes queries in memory.
- Redshift is easy because it can ingest structured, semi-structured and unstructured datasets (via Amazon S3 or DynamoDB) up to a petabyte or more, and then slice ‘n dice that data any way you can imagine with SQL.
- Redshift is cheap because you can store data at price points that are basically unheard of in the world of data warehousing. For example, if you pay upfront for a three-year term, you can rent a dc2.large node with a capacity of 5 terabytes for $2,465.
In short, the promise of Amazon Redshift is to make data warehousing cheaper, faster, and easier. You can analyze much bigger and complex datasets than ever before, and there’s a rich ecosystem of tools that work with Redshift.
A few more noteworthy points about Redshift:
- Redshift calls itself a “fully managed service”. The “managed service” part is definitely true—Redshift is fully managed from the hardware layer down. However, the “fully” part might be a bit misleading—there are lots of knobs to turn if you want to extract the maximum performance from Redshift.
- Your Redshift comes empty by default. But there are plenty of data integration/ETL tools that allow you to quickly populate your cluster and get started analyzing and reporting data for business intelligence and analytics.
- Redshift is a database, so you can store a history of your raw data AND the results of your transformations. In April 2017, Amazon also introduced Redshift Spectrum, which enables you to run queries against data in Amazon S3 (which is a much cheaper way of storing your data).
- You need to know how to write SQL queries to use Redshift (the “run big, complex queries” part), so the user base is largely made up of data analysts and data scientists.
With Integrate.io we make it very easy to figure out what knobs to turn when using Amazon Redshift. For example, below is a screenshot from our “Cluster Health” dashboard.
The Cluster Health Dashboard helps data teams measure and improve SLAs. It does this by surfacing:
- SLA measures for any connected app, down to the individual user
- Recommendations to improve performance and optimize costs
- Top 5 slowest queries with quick links to Query Optimization Recommendations
SEE PERSONALIZED RECOMMENDATIONS FOR YOUR CLUSTER NOW
In summary, one way to think about Spark and Redshift is to distinguish them by what they are, what you do with them, how you interact with them, and who the typical user is.
Apache Spark vs. Amazon Redshift: Data Architecture
In very simple terms, you can build an application with Spark, and then use Redshift both as a source and a destination for data.
Why would you want to do that? A key reason is how Spark and Redshift each process data, and how much time it takes to produce a result:
- With Spark, you can do real-time stream processing, i.e. you get a real-time response to events in your data streams.
- With Redshift, you can do near real-time batch operations, i.e. you ingest small batches of events from data streams, to then run your analysis to get a response to events.
Fraud detection is just one (highly simplified) example. You could build an app with Spark that detects fraud in real-time from, say, a stream of Bitcoin transactions. Given its near real-time character, Redshift wouldn’t be a great fit in this case.
But let’s say you wanted to have more signals for your fraud detection to improve prediction rates. You could load data from Spark into Redshift, and then join it with historic data on fraud patterns. Of course, you couldn’t do this in real-time—the result would come too late for you to block the transaction. Rather, you would use Spark to immediately block it and then wait for the result from Redshift to decide if you should keep blocking, send it to a human for verification, or approve it.
Related Reading: “Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning” (Amazon Big Data Blog)
You can see how the separation between “apps” and “data warehousing” we created at the start of this post is, in reality, an area that’s shifting or even merging.
To help data engineers stay on top of both the apps and the data warehouse, we’ve built a feature in Integrate.io called “App Tracing”. This feature correlates information about applications (dashboards, orchestration tools) with cluster performance data. App Tracing can answer questions like:
- Which dashboard and user is contributing to this spike in queries?
- What is the average latency of a dashboard? Of all dashboards executed by a particular user?
- What caused the spike in latency for my Airflow or Pinball tasks? Which query slowed down and why?
- Why are my Airflow or Pinball jobs failing? How can I quickly find those queries in Integrate.io?
START FREE AND IDENTIFY DATA APPS THAT ARE SLOWING DOWN YOUR CLUSTER NOW
Apache Spark vs Amazon Redshift: Data Engineering
The traditional border between developers and BI analysts/data scientists is starting to fade, which has given rise to a new occupation: data engineering. I’ll use a definition for data engineering from Maxime Beauchemin:
“In relation to previously existing roles, the data engineering field [is] a superset of business intelligence and data warehousing that brings more elements from software engineering, [and it] integrates the operation of ‘big data’ distributed systems”.
Spark is one such “big data” distributed system, and Redshift is the data warehousing part. Data engineering is the discipline that unites them both.
For example, we’ve seen more and more “code” making its way into data warehousing. Code allows you to author, schedule, and monitor data pipelines that feed into Redshift, including the transformations on the data once it sits inside your cluster. And you’ll very likely have to ingest data from Spark. The trend toward using “code” in data warehousing implies that just knowing SQL isn’t sufficient any longer—you need to know how to write code, hence the rise of the “data engineer.”
Conclusion: Apache Spark vs. Amazon Redshift
In this Spark vs. Redshift comparison, we’ve discussed:
- Use cases: Spark is intended to improve application development speed and performance, while Redshift helps crunch massive datasets more quickly and efficiently.
- Data architecture: Spark is used for real-time stream processing, while Redshift is best suited for batch operations that aren’t quite in real-time.
- Data engineering: Spark and Redshift are united by the field of “data engineering”, which encompasses data warehousing, software engineering, and distributed systems.
For your own big data architecture, you’ll likely end up using both Spark and Redshift, each one to fulfill a specific use case that it’s best suited for. That’s why we’ve created Integrate.io to help you understand exactly what’s going on in your Redshift data warehouse—automatically capturing metadata, tracking dependencies, monitoring trends over time, and much more. Get in touch with us today to start your free trial.