Choosing the right data warehouse is a critical component of your general data and analytic business needs. One of the biggest questions that businesses ask when choosing their data warehouse providers is this: Should you use Snowflake, Amazon RedShift, or Google's BigQuery data warehouse for your business needs?
Both Amazon and Google have impressive data warehouses with RedShift and BigQuery. Each of these solutions can run analytics at-scale rapidly. We're not comparing apples and oranges here. This is apples to apples. But, there are real use cases that each of them excel at, and both solutions can be valuable depending upon your business' situational needs.
At Integrate.io, we support both solutions. So, this post will act as a guide for businesses looking to understand which data warehouse is best suited for their particular workflows and projects.
Enjoying This Article?
Receive great content weekly with the Integrate.io Newsletter!
Table of contents
- Quick Notes
- What is a Data Warehouse
- Understanding Data Processing Systems
- What is RedShift?
- What is BigQuery?
- RedShift vs. BigQuery Comparison
- Final Thoughts
- Integrate.io Can Help Manage Your Data
Both Amazon RedShift and Google BigQuery are effective tools for your business. Choosing between the two depends on your business model and preferences.
You could get better results with RedShift if:
- Your business uses data warehouses for everyday operations
- Your business has consistently predictable workloads and operations
- Your business needs a simpler solution in terms of cost and management
- Your workflow is not based on Other Google technologies
You could see better results with BigQuery if:
- You use data warehouses for data mining operations
- Data warehousing is not a big part of your daily operations
- Your business takes on a lot of projects with wildly different conditions
- Your business changes operations often, making workflows less consistent
- You already rely on other Google platforms for business processes
Every business is different, and you need a more detailed look into your operations and the features of both options. Read more to learn about the details of each platform and how they can work for your business.
3 Key Differences of RedShift vs. BigQuery
- Amazon RedShift is provisioned on clusters and nodes. Google BigQuery is serverless.
- RedShift supports 1,600 columns in a single table, BigQuery supports 10,000 columns.
- RedShift requires periodic management tasks like vacuuming tables, BigQuery has automatic management.
What is a Data Warehouse?
Data warehouses (sometimes called columnar storage solutions) are storage facilities for your company’s business intelligence. Both RedShift and BigQuery are data warehouses. You can put all of your data from your blended tech stack into one of these warehouses and start to run analytics on it to help you make critical business decisions, forecast trends, budget, and other critical business processes.
Trend analysis is a typical data warehouse use case.. Businesses push all of their tech stack data (e.g., customer service, marketing, sales, HR, etc.) into the warehouse to run analytic workloads.
Data Warehouse Use Case Example
Example: A business may want to know more about their sales leads. This will help them better understand their customers and personalize sales pitches and content delivery. To do this, that business can connect their Salesforce data with a data warehouse and run a query to discover which leads are the most valuable and which ones are most likely to churn.
Other Data Warehouse Features
Beyond columnar storage, data warehouses like RedShift and BigQuery have Massively Parallel Processing (or MPP.) This lets them distribute query requests across multiple servers to accelerate processing. So, multiple processors — each with their own memory and operating system — will handle specific segments of the query.
To really understand why data warehouses are valuable for analytic workloads, you need to understand the differences between Online Transaction Processing (OLTP) and Online Analytic Processing (OLAP) data processing systems.
Understanding Data Processing Systems
Let's quickly address the differences between OLTP and OLAP data processing systems.
Online Transaction Processing (OLTP)
OLTP (or Online Transaction Processing) is what most businesses use for processing transactions during day-to-day operations (think ATMs, retail sales systems, text messaging, etc.) And, we've all been using OLTP for over 40 years (it's still mind-boggling that SQL was released in the early 70s). OLTP stores each row in a table as an object.
OLTP's primary goal is data processing. It's great at rapid processing that maintains data integrity over multiple sequences.
OLTP Example Use Case
In this example, two people withdraw money from the same online bank account at precisely the same moment. OLTP will take the first authorized user and process that transaction. And, it will ensure that neither user is able to withdraw more money than is present in the bank account — even if they both start the operation simultaneously.
To do this, OLTP runs checks against every row in the query. This ability to perform ACID (Atomicity, Consistency, Isolation, Durability) transactions means that OLTP is extremely useful for ensuring data validity in the case of errors or outages.
Online Analytic Processing (OLAP)
OLAP (or Online Analytic Processing) is what data warehouses use to run queries. OLAP stores each column as an object. So, it's great at crawling through massive data sets to find trends. And, OLAP can skip over pieces of data to find the exact data you need to aggregate.
OLAP Example Use Case
Let's say you wanted to run a query (say, finding every revision on Wikipedia) on an OLTP database. That OLTP database would have to access every single field in every row to run that process. With OLAP, it can utilize columns to access only the fields you need, saving you an absolutely enormous amount of computing power and time.
In fact, OLAP is so fast at analytic processing, that the majority of businesses utilizing data warehouses look for sub-10 speeds (i.e. under 10 seconds). There are certainly use cases where the processing could take an hour or more, but you would be dealing with massive amounts of data at that point and hyper-complex schemas.
Learn more about OLTP and OLAP
What is RedShift?
RedShift is Amazon's data warehouse, and it's part of Amazon's massive overall cloud architecture, AWS.
Amazon acquired the source code for RedShift from ParAccel, who was developing ParAccel Analytic Database — a PostgreSQL-based database that utilized columnar data organization. So, RedShift is an MPP data warehouse that's built on a PostgreSQL fork.
While RedShift shares many commonalities with PostgreSQL (such as its relational qualities), it also is unique in that it's column structure. It doesn't support indexes, and uses distribution styles and keys for data organization. Amazon also has a unique query execution engine for RedShift that differs from PostgreSQL.
The important thing to note about RedShift being built on top of a PostgreSQL fork is that it maintains some of its transactional qualities — making it a hybrid database of sorts. RedShift can still roll-back on transactions, which is a semi-unique feature in the data warehouse market.
For more information on Integrate.io's native RedShift connector, visit our integration page.
What is BigQuery?
BigQuery is Google's data warehouse, and it's part of Google's massive overall cloud architecture, Google Cloud.
BigQuery was one of the first major data warehouses on the market following C-Store and Monet DB. To function, BigQuery executes Dremel (a query engine developed by Google for read-only nested data that supports an SQL-like syntax) over a REST interface.
Google defines Dremel as:
"Dremel is a query service that allows you to run SQL-like queries against very, very large data sets and get accurate results in mere seconds."
When BigQuery first launched, it had strictly maintained Dremel's strange hybrid SQL language — which was awkward at best. Now, it supports standard SQL language.
Google has some unique technology that powers BigQuery operations. Here's a brief overview of a typical job execution:
- Borg (Google's large-scale cluster management) allocates resources to Dremel jobs (which are typically executed over Google's Web UI or REST.)
- Colossus (Google's planet-scale storage system) provides the data to each Dremel job.
- Capacitor (Google's columnar storage format) organizes and compresses the data being pulled for the Dremel job.
- Juniper (Google's inner data network) translates and helps Dremel jobs read data on the Colossus system.
For more information on Integrate.io's native Google BigQuery connector, visit our Integration page.
RedShift vs. BigQuery Comparison
To determine which platform is better for your business, compare their features based on what is relevant to your business. This comparison includes pricing, performance, and manageability. Your business may prioritize these differently than other businesses, which will change the outcome of the comparison.
Price: RedShift vs. BigQuery
RedShift's pricing model is extremely simple. For the purposes of this comparison, we're not going to dive into RedShift Spectrum pricing, but you can check here for those details.
RedShift Spectrum is not included in this analysis because it is a premium product that has additional functionality. It does not compare well with any of the offerings the BigQuery has and is only for businesses with a specific need for the functionality of RedShift Spectrum.
RedShift Spectrum lets you run RedShift queries directly against Amazon S3 storage, which is useful for tapping into your data lakes if you use Amazon simple storage for your business needs. If you do, then RedShift Spectrum is almost automatically the better option for your business.
With RedShift, you can choose from either Dense Compute or the large Dense Storage. The cheapest node you can spin up will cost you $0.25 per/hour, and it's 160GB with a dc2.large node. Dense Storage runs at $0.425 per TB per hour. This cost covers both storage and processing. RedShift So, the lowest price you can get on RedShift is $306 per TB per month. And, you can pay upfront for massive discounts.
This makes RedShift interesting to work with. If you can calculate your run-times and how often you'll need to spin up each node, you can cut costs dramatically — especially if you pay upfront. Since most businesses aren't going to be constantly running their RedShift nodes, getting granular is usually in your best interest.
For example, you may only run RedShift during the day when people are interacting with your stack or service. If that's the case, you can adjust your upfront buying habits to reflect that behavior.
BigQuery's pricing is much more complicated. On the surface, BigQuery looks cheaper. Storage costs $20 per TB per month, a good $286 cheaper than RedShift. But, BigQuery charges separately for storage and querying. Queries cost $5/TB. So, while storage is cheaper, query costs can add up quickly.
There are some pros and cons to this method. Really, BigQuery is perfect for a certain type of customer. Let's say your business deals with spiky workloads. You run rapid queries a few times a day. BigQuery would be a far better option since you have to pay by the hour for RedShift. BigQuery may also be the best solution for data scientists running ML or data mining operations — since you're dealing with extremely large, spikey workloads.
BigQuery costs $20 per TB per month for the storage line and $5 per TB processed on that storage line.
RedShift costs $306 per TB per month for storage AND unlimited processing on that storage.
There isn't a winner here. RedShift is more economical for everyday data warehouse operations for most businesses. But, BigQuery is better for businesses looking to do data mining or those who deal with extremely variant workloads.
Pricing Choice Examples
It can be helpful to see how these two options compare in real-world scenarios. Here are two examples based on real w-world use cases.
Let's say that you only run queries around 5% of your day. In this case, BigQuery is probably going to be more cost-effective since you're paying for query processing on-demand. Since you're paying $5/TB of data processed, you may only process three 100GB chunks during a day. This would cost you $1.50 plus the $0.70 for storage. You see this all the time with businesses that are mainly using their data warehouse to perform data mining jobs in chunks. And, this also makes BigQuery valuable for data scientists who are running jobs a few times a day.
Your business wants a day-to-day warehouse to help with your sales or marketing stack. In this case, you need constant run-time and the ability to perform queries hundreds or thousands of times per day. RedShift is probably going to be cheaper since you aren't going to be charged for each of those queries. So, let's say each of those hundreds of queries processes 50GB. You would be paying $5/TB with BigQuery, and your costs would add up rapidly. With RedShift, you're simply charged for how long you use your nodes.
Performance: RedShift vs. BigQuery
Performance is tricky when it comes to RedShift vs. BigQuery. Since BigQuery simply abstracts prices based on how much data you process, you're not locked into a specific resource when you run a query. RedShift, on the other hand, is limited by the node you're running. But, that's not the only factor that goes into query performance.
The size of your data table, schema complexity, and the number of concurrent queries (50 is the max for both) that you're running also make a massive difference. There have been plenty of benchmarks comparing the two over the years. But, none of those benchmarks are particularly helpful in a broad sense.
To get a sense of how absurd some of the benchmark wars are between BigQuery and Amazon, here is a list of the drama.
- Google presented a TPC-H benchmark at CloudAir in San Francisco in 2016 that showed BigQuery outperforming Amazon (they decided to only use one of the performance metrics instead of all 26.)
- Amazon (very sarcastically) rebutted claiming that Google was cherry picking the query and ran a similar TPC-H benchmark that showed RedShift outperforming BigQuery on almost all tests (they conveniently decided to use an 8-node DC1.8XL which runs at about $20k a month)
- An independent researcher decided to run a similar test on taxi cab data, which showed that BigQuery was 43x faster than RedShift (the research is flawed due to node selection and query type.)
- Then, more back-and-forth between the two commenced.
- At this point, about 500 private companies publish their own benchmarks to cherry pick the results they need to sling their products.
That's the unfortunate state of industry benchmarks. We could spin up a query and run a benchmark, but it would have its own issues with generalizability. And, there are certainly cases where both solutions outperform each other. The complexity of the schema, joins, resources, tables, etc. are too diverse to give you a grounded answer on benchmark performance.
Performance Benefits of Each Platform
In our experience with clients, RedShift is great at handling everyday business processes. This means spinning a node during work hours for BI tools and interfaces. It's less expensive, has plenty of power to handle semi-complex schemas, and it's easy-to-use.
BigQuery is great at handling niche business workloads that query big chunks in a small timeframe and for data scientists and ML/data mining.
In many cases, the difference between the two is going to depend upon your RedShift resources. So, if you're paying for a single dc2.large node, BigQuery is most likely going to outperform RedShift. But, if you're spinning up an expensive 8-node DC1.8XL, RedShift is probably going to outperform BigQuery.
In short, the details make the difference in which one you should choose based on performance.
Manageability: RedShift vs. BigQuery
When we start to talk about manageability, things, again, get complex. The vast array of features provided by both RedShift and BigQuery make extrapolating ease-of-use incredibly complicated.
We're going to focus on four key layers of manageability. But, there are certainly additional variables (like a million of them) to take into account.
Here's what we'll cover:
- Data types/updates and deletes
Data Types/Updates and Deletes
RedShift supports standard SQL data types, and BigQuery works with some standard SQL data types and a small range of sub-standard SQL. One of the biggest benefits of BigQuery is that it treats nested data classes as first-class citizens due to its Dremel capabilities. With RedShift, you have to flatten out your data before running a query.
Both of them can handle updates and deletes when something goes wrong in the query. Since BigQuery and RedShift are append-only, many assume they can't do updates and deletes. They can. On BigQuery, the update and delete process exists, but it's relatively expensive, and your options are limited. So, it's not a widely used feature. With RedShift, you can reclaim tables with Postgre Vacuuming (which has its own host of complications) so update and delete support is typically better with RedShift.
Also, RedShift gives users the ability to roll back on transactions, which BigQuery doesn't.
Out-of-the-box, BigQuery is much simpler to use than RedShift. You don't have to perform many tweaks, cluster management is a breeze, and the complexities of database config, etc. are handled by BigQuery. That being said, Integrate.io abstracts the complexities of RedShift away from users with easy-to-perform workflows and integrations, so RedShift doesn't have to be difficult to use.
When it comes to security, both systems are comparable. RedShift uses Amazon IAM for identity, and BigQuery uses Google Cloud IAM. Both services work perfectly for almost all business scenarios. Google does have great B2B identity management with OAuth, so you can give identity controls to 3rd parties without introducing them into your entire ecosystem.
Both Google and Amazon (unsurprisingly) have a wealth of integrations available. Almost every major BI and data analysis tool runs perfectly well with both warehouses. We won't go in-depth with this section. Sure, RedShift is built on a PostgreSQL fork, so it originally had more native integrations, but the playing field has leveled simply due to the sheer volume of warehouse transactions that Google processes (BI tools don't want to miss out on the revenue stream.)
Both BigQuery and RedShift are incredible data warehouse systems that can help businesses redefine their daily workflows. There are some differences, but there are far more similarities.
Probably the biggest consideration for most businesses is price. RedShift is a little easier to forecast pricing due to its on-demand, by-the-hour nature. But, in many business scenarios, BigQuery's $5/TB query cost may make more sense.
Although price is important, the other factors should not be ignored. The small details make a big difference in which system can work best for your needs. Deciphering which is best can be complicated. Fortunately, we offer IT services that can help you decide.
Integrate.io Can Help Manage Your Data
Enjoying This Article?
Receive great content weekly with the Integrate.io Newsletter!
Integrate.io can help you manage your data. Our cloud-based ETL solution can integrate with both systems, giving you the advantages that you need regardless of which system you choose. However, we can also look at your current systems and help you identify the key differences between the two that make one a better choice over the other.
Integrate.io integrates with RedShift and BigQuery, making it easy to clean, simplify, and organize your data. Contact Integrate.io to schedule a demo and to learn more about how integration with a data warehouse makes your business more productive.