A first look at the new RA3 Amazon Redshift node type
Table of Contents
Today we’re really excited to be writing about the launch of the new Amazon Redshift RA3 instance type. The launch of this new node type is very significant for several reasons:
- With 64Tb of storage per node, this cluster type effectively separates compute from storage. On RA3 clusters, adding and removing nodes will typically be done only when more computing power is needed (CPU/Memory/IO). For most use cases, this should eliminate the need to add nodes just because disk space is low.
- RA3 nodes have been optimized for fast storage I/O in a number of ways, including local caching. The nodes also include a new type block-level caching that prioritizes frequently-accessed data based on query access patterns at the block level.
- RA3 nodes have 5x the network bandwidth compared to previous generation instances. This should significantly improve the performance of COPYs, INSERTs, and queries that require large amounts of data to be redistributed between nodes.
This is the first feature where Amazon Redshift can credibly claim “separation of storage and compute”. And because a ra3.16xlarge cluster must have at least two nodes, the minimum cluster size is a whopping 128TB. This number is so high that it effectively makes storage a non-issue.
In this post, we’re going to explore the performance of the new ra3.16xlarge instance type and compare it to the next largest instance type, the ds2.8xlarge. Since the ra3.16xlarge is significantly larger than the ds2.8xlarge, we’re going to compare a 2-node ra3.16xlarge cluster against a 4-node ds2.8xlarge cluster to see how it stacks up.
On paper, the ra3.16xlarge nodes are around 1.5 times larger than ds2.8xlarge nodes in terms of CPU and Memory, 2.5 times larger in terms of I/O performance, and 4 times larger in terms of storage capacity:
On to the tests!
A reported improvement for the RA3 instance type is a bigger pipe for moving data into and out of Redshift. Since loading data from a storage layer like S3 or DynamoDB to compute is a common workflow, we wanted to test this transfer speed.
We used Redshift’s COPY command to read and load data files from S3, which had been unloaded from a source table with 3.8 billion rows. The target table was dropped and recreated between each copy. We followed best practices for loading data into Redshift, such as using a manifest file to define the data files being loaded and defining a distribution style on the target table.
While the DS2 cluster averaged 2h 9m 47s to COPY data from S3 to Redshift, the RS3 cluster performed the same operation at an average of 1h 8m 21s:
The test demonstrated that improved network I/O on the ra3.16xlarge cluster loaded identical data nearly 2x faster than the ds2.8xlarge cluster.
One of the things we were particularly interested in benchmarking is the advertised benefits of improved I/O, both in terms of network and storage. In our experience, I/O is most often the cause of slow query performance. These benefits should supposedly improve the performance not only of getting data into and out of Redshift from S3, but also the performance of transferring data between nodes (for example, when data needs to be redistributed for queries that join on non-distkey table columns), and of storing intermediate results during query execution.
To compare relative I/O performance, we looked at the execution time of a deep copy of a large table to a destination table that uses a different distkey. This should force Redshift to redistribute the data between the nodes over the network, as well as exercise the disk I/O for reads and writes.
For this test, we used a 244 Gb test table consisting of 3.8 billion rows which was distributed fairly evenly using a DISTKEY. We created an empty target table which differed from the source table in that it used DISTSTYLE of EVEN, and then did an INSERT INTO … SELECT * from the source table into the target table several times. To make it easy to track the performance of the SQL queries, we annotated each query with the task benchmark-deep-copy and then used the Integrate.io dashboard to view the performance on each cluster for all SQL queries in that task.
The test showed that the DS2 cluster performed the deep copy on average in about 1h 58m 36s:
while the RA3 cluster performed almost twice the number of copies in the same amount of time, clocking in at 1h 2m 55s on average per copy:
This indicated an improvement of almost 2x in performance for queries which are heavily in network and disk I/O.
Benchmarks are great to get a rough sense of how a system might perform in the real-world, but all benchmarks have their limitations. So in the end, the best way to evaluate performance is with real-world code running on real-world data.
To compare the 2-node ra3.16xlarge and 4-node ds2.8xlarge clusters, we setup our internal data pipeline for each cluster. We copied a large dataset into the ds2.8xlarge, paused all loads so the cluster data would remain fixed, and then snapshotted that cluster and restored it to a 2-node ra3.16xlarge cluster.
We then started our data product pipeline and fired up our Integrate.io dashboard to quantitatively monitor performance and characteristics of the two clusters.
Data Product Pipeline Characteristics
Our primary Redshift data product pipeline consists of batch ETL jobs that reduce raw data loaded from S3 (aka “ELT”). The ETL transformations start with around 50 primary tables, and go through several transformations to produce around 30 downstream tables. These 30 tables are then combined and loaded into serving databases (such as Elasticsearch) for serving. So this all translates to a heavy read/write set of ETL jobs, combined with regular reads to load the data into external databases.
While our pipeline also includes some external jobs that occur in platforms outside of Redshift, we’ve excluded the performance of those jobs from this post, since it is not relevant to the ra3.16xlarge to ds2.8xlarge comparison.
Overall Query Performance
To start with, we looked at the overall query performance of our pipeline running on the identical data on the ds2.8xlarge cluster and the ra3.16xlarge cluster. Our Integrate.io dashboards reported a P95 latency of 1.1 seconds and a P99 latency of 34.2 seconds for the ds2.8xlarge cluster:
The ra3.16xlarge cluster showed a noticeable improved overall performance:
P95 latency was 36% faster at 0.7s, and P99 latency was 19% faster–a significant improvement.
Digging into the details
Viewing our query pipeline at a high-level told us that throughput had on average improved significantly on the ra3.16xlarge cluster. But the performance of data product pipelines is often limited by the worst-performing queries in the pipeline. So next we looked at the performance of the slowest queries in the clusters.
Since we tag all queries in our data pipeline with SQL query annotations, it is trivial to quickly identify the steps in our pipeline that are slowest by plotting max query execution time in a given time range and grouping by the SQL query annotation:
Each series in this report corresponds to a task (typically one or more SQL queries or transactions) which runs as part of an ETL DAG (in this case, an internal transformation process we refer to as sheperd).
The slowest task on both clusters in this time range was get_samples-query, which is a fairly complex SQL transformation that joins, processes, and aggregates 11 tables. On the 4-node ds2.8xlarge, this task took on average 38 minutes and 51 seconds:
This same task running on the 2-node ra3.16xlarge took on average 32 minutes and 15 seconds, an 18% improvement!
This result is pretty exciting: For roughly the same price as a larger ds2.8xlarge cluster, we can get a significant boost in data product pipeline performance, while getting twice the storage capacity.
Moving on to the next-slowest-query in our pipeline, we saw average query execution improve from 2 minutes on the ds2.8xlarge down to 1 minute and 20 seconds on the ra3.16xlarge–a 33% improvement!
Separation of Storage and Compute
The launch of the new RA3 instances addresses one of the biggest pain-points we’ve seen our customers have with administering an Amazon Redshift cluster: managing storage. While seemingly straightforward, dealing with storage in Redshift causes several headaches:
- Having to add more CPU and Memory (i.e. nodes) just to handle the storage of more data, resulting in wasted resources;
- Having to go through the time-consuming process of determining which large tables aren’t actually being used by your data products so you can remove these “cold” tables;
- Having to run a cluster that is larger than necessary just to handle the temporary intermediate storage required by a few very large SQL queries
We’ve seen variations of these problems over and over with our customers, and expect to see this new RA3 instance type greatly reduce or eliminate the need to scale Redshift clusters just to add storage. In practice, we expect that workloads will likely always become CPU, Memory, or I/O bound before they become storage bound, making the decision to add a node (vs scale back or optimize the data product pipeline) much simpler.
And then there’s also Amazon Redshift Spectrum, to join data in your RA3 instance with data in S3 as part of your data lake architecture, to independently scale storage and compute.
The performance boost of this new node type (a big part of which comes from improvements in network and storage I/O) gives RA3 a significantly better bang-for-the-buck compared to previous generation clusters.
We’ve also received confirmation from AWS that they will be launching another RA3 instance type, ra3.4xlarge, so you’ll be able to get all the benefits of this node type even if your workload doesn’t require quite as much horsepower.
We highly recommend giving this new node type a try–we’re planning on moving our workloads to it!
As always, we’d love your feedback on our results and to hear your experiences with the new RA3 node type. Feel free to get in touch directly, or join our Redshift community on Slack.