This article is written for explaining what exactly is Amazon Redshift and what is not. Hope it helps those who are curious about Amazon Redshift.
Redshift is a Disruptive Price Alternative in DWH
A data warehouse is a system for analyzing and managing data, a database for big data where data is mostly just being added without much updating. The difference between ordinary database systems like MySQL and a data warehouse is that a data warehouse can save a huge volume of past or archived data and is good at online analytical processing (OLAP) such as data aggregation. A data warehouse is unable to perform online transaction processing (OLTP) in milliseconds nor is it ideal for handling frequent updates. Data warehousing is used with Business Intelligence (BI) tools that help with enterprise decision-making, and for optimizing systems based on historical data. Data warehousing has been considered a costly venture, ranging anywhere from hundreds of thousands of dollars to a few million per year. The price is often high because data warehousing is used for a variety of purposes and requires specialized hardware. However, because of recent advances in low-cost, big data processing technology, smaller organizations such as advertising and social gaming companies are using big data analysis to create new business. Big data analysis and data warehousing often go hand-in-hand but are also seen as two separate services because of major price differences. Redshift has managed to make it possible for even start-ups and other small-scale businesses to make use of both without any major increase in price. This is why it is being called disruptive. Although the same kind of processing can be done on Hadoop Hive at a comparatively low cost, there are still substantial costs when it comes to securing a Hadoop specialist or managing a large scale server setup. It remains very hard for small businesses to enjoy the benefits of big data analysis with any degree of ease or sufficiency.
So what exactly are the differences between Redshift and existing data warehouse tools on Hadoop?
Try using the following key words in a search. Even for terabyte-scale data warehousing, price tags are in the millions. Multiple DBAs also have to be brought on board to run the system. Amazon will provide you with the same level of service for as low as US$1,000 a year per terabyte (three year reserved instance, minimum 2TB). Database management itself takes place in the Amazon cloud, which is included in the cost. Plainly put, it's as simple as comparing $1 million to $1,000 (that's three whole zeroes!). In fact I think it would be hard to find a rationale that would explain this gap in price. That is why there has been so much noise about Redshift. At Integrate.io, we ran benchmarks for the same queries on Hadoop (using Hive and the same SQL query), and the results we published show that Redshift is more than 10x superior in both speed and cost, even when comparing just normal execution cost. At one point, our benchmark was the Hottest Slide on SlideShare, which just shows the high level of interest that Amazon Redshift has sparked amongst big data specialists.
We'll handle your ETL pipeline.
Features as Data Warehouse
First and foremost, the greatest feature of Redshift is its columnar storage technology. As its name suggests, a column-oriented data structure stores data by columns, as opposed to traditional row-oriented databases. When indexes are used, traditional databases can be great at retrieving data from a specific row in an instant, but they aren't as good for instance aggregate functions for all records, because extra data has to be processed. Columnar databases are ideally suited for aggregate processing where only a few specific columns need to be worked on. Plus, compression efficiency is much greater because it is common for specific data to be repeated throughout an individual column. Redshift chooses from among seven different compression algorithms (as of February 28) for each column. It also gives you ways to detect the optimal compression method for data that is going to be loaded or for already saved data. Even with big data, results are obtained with remarkable speed because only the compressed columns necessary for a query are retrieved. Hadoop has supplemental modules that exploit the merits of columnar orientation (RCFile, HBase, etc.) but they present some technical barriers. In my experience, these modules are not often used and require more in-depth knowledge of Hadoop. But the truly stand-out feature of Redshift when compared with other data warehouses is its scalability, as might be expected from Amazon. With Massively Parallel Processing (MPP), as data volume grows, processing and storage nodes can be added in order to preserve or elevate processing speed. We verified this function at Integrate.io, running benchmarks for data loading and query speed.
Redshift actually grew out of technology originally developed by ParAccel, which Amazon has since put money into. MPP is a function still offered by ParAccel and other vendors but it is significant that MPP is now available from AWS. This is the greatest feature unique to Amazon, namely that scale-out (increasing the number of nodes) can be accomplished in a matter of clicks! I think this is going to be a unique part of this public cloud-based service and a very hard act to follow by any other vendor. In summary, Redshift is a columnar database with good compression options and flexible scalability via MPP with scale-out in just a few clicks. When I first heard about these features and then after we successfully verified them, I actually got goosebumps.
Redshift is built on PostgreSQL database technology. That means almost all postgreSQL drivers, JDBC drivers, and compatible tools can be used. In short, once you load data onto Redshift, you're able to easily utilize existing applications and newly created web services! Of course, this doesn't mean all PostgreSQL functions can be used (for example, only primitive data types are supported), but we can assume that the barrier to entry is significantly lower than Hadoop's.
The biggest point for concern is that Redshift has only just been opened so there is not much available about performance yet. But this technology has already been utilized and tested from the start by the world’s number one e-commerce site, namely Amazon, and was also being tested by a variety of enterprises significantly prior to wide release. There is bound to be a number of issues after release since it's going to be used widely, but these should settle down fairly quickly. In the benchmarks referred to earlier, we found that it took a long time to load big batches of data onto Redshift, but we also found that by increasing the number of nodes, we achieved a significant increase in performance. Integrate.io is also working to develop a continuous data upload mechanism to avoid delays. Through articles like this one, Integrate.io aims to support anyone who chooses to use Redshift, as its technical content and ways that it can be used are not widely known yet.
We'll handle your ETL pipeline.
Segregation with Hadoop
Naturally, there are many areas in which Hadoop excels over Redshift. There are examples of advanced processing that can only be done on Hadoop such as processing the whole database record by record and analytics that use complex machine learning. Hadoop may be superior in terms of cost performance for processing data that is not analyzed frequently (annual or monthly petabyte scale batch processing). On the other hand, for processing data close to real time with a short turnaround on constantly updated data (minutes old), Redshift may show big advantages over Hadoop and other data warehousing systems. Such applications are common in advertising technology, digital marketing, and social gaming analysis to name a few.
Redshift is the Killer App for Big Data Analysis
Till now we at Integrate.io have proceeded under the belief that the shortest path to bestowing the benefits of big data to companies of any size lie in figuring out how to make it cheaper to use Hadoop. We held the belief that Hadoop was the only possibility for the general use of big data. This viewpoint has been completely overturned by Redshift. This is a most auspicious thing. Why? Because this means that access to big data analytics is now open to any size company, providing big data that is easy to use and a low price point. At Integrate.io, we want to help make Redshift easy to use so everyone can leverage big data in the cloud to open up the path to a better world.