Amazon Redshift’s DISTKEY and SORTKEY are powerful tools for optimizing query performance. Because Redshift is a columnar database with compressed storage, it doesn't use indexes like transactional databases — such as MySQL, Microsoft SQL, and PostgreSQL — would. Instead, it uses DISTKEYs and SORTKEYs.
Here are four things you need to know about Amazon Redshift’s DISTKEY and SORTKEY:
- DISTKEY and SORTKEY are tools for optimizing query performance on Amazon Redshift
- Redshift is a columnar database with compressed storage and doesn’t use indexes in the same way as transactional databases.
- Choosing values for DISTKEY and SORTKEY can be challenging.
- Choosing the wrong values can impact query performance.
Choosing the values to use as your DISTKEY and SORTKEY are not as straightforward as you might think. In fact, setting a DISTKEY/SORTKEY incorrectly can negatively impact query performance. Learn more about Amazon Redshift DISTKEY and SORTKEY in the tutorial below.
Table of Contents
- Know The Data
- DISTKEY Redshift: Investigating The Query
- Adding DISTKEY and SORTKEY
- Solving The Puzzle
- Trying a Different DISTKEY and SORTKEY
Know The Data
This example uses a series of tables called
# is a series of numbers. Each record of the table consists of an error that happened on a system with its (1) timestamp, and (2) error code. Each table has 282 million rows (lots of errors!). Here, there is a query that needs optimizing. The query gets the number of errors per error type for each time slice:
SELECT err_code, created_at, count(*) FROM system_errors1 GROUP BY created_at, err_code;
You can create a table without the DISTKEY before setting it and see how the query performs:
CREATE TABLE system_errors1 (
On this Redshift cluster (2-node dc1.large), the query took 20.52 seconds to execute. This isn’t too bad, considering the number of rows in the table. But if you look at the CPU usage, both compute nodes were used up to 30% of CPU. You can improve this by investigating query performance.
DISTKEY Redshift: Investigating The Query
Now it’s time to check query performance by analyzing the Amazon Redshift Console. Thankfully, it offers useful graphs and metrics to analyze query performance.
Below are the "Query Execution Details" for the query:
Look at the warning sign! Something must have been wrong:
This warning occurred because rows that need to be aggregated (rows sharing the same err_code and created_at values) are spread across multiple compute nodes. Each node must aggregate its own rows first; then, the leader node has to aggregate the results again. That's why you see two "Aggregate" steps in the above screenshot. More importantly, a large amount of data was sent to the leader node across the network, which became the performance bottleneck. You can avoid this by putting all rows sharing the same err_code and created_at values on a single node. This can be done by defining the DISTKEY.
Adding DISTKEY and SORTKEY
To collocate all relevant rows in a single node, you can use either the column
created_at as the DISTKEY. If you want to run a query grouped by
err_code as DISTKEY and SORTKEY.
CREATE TABLE system_errors2 (
) DISTKEY(err_code) SORTKEY(err_code);
Take a look at how the query performs against the new table:
SELECT err_code, created_at, count(*) FROM system_errors2 GROUP BY created_at, err_code;
Shockingly, this query took 54.9 seconds! That’s 2.5 times slower than the first query against a table with no DISTKEY/SORTKEY. This new table puts all rows of an error code on the same node and stores them next to each other. Why would the query against this table be even slower than the table with no DISTKEY/SORTKEY?
Solving The Puzzle
After creating two tables with and without DISTKEY, the table with the DISTKEY was much slower than the other:
no DISTKEY -> 20.52 seconds
err_code as DISTKEY -> 54.9 seconds
Why did this happen? The query’s execution details will provide insights:
You’ll notice the long red lines. This means that the slowest node took significantly longer than the average processing time. In this case, it took four times more than the average. The slowest node must have had more rows than the other nodes. Running the following query will tell you how many rows each err_code has:
SELECT err_code, count(*) FROM system_errors2 GROUP BY err_code;
err_code | count
1210 | 22336
10010 | 252160
1205 | 517440
1207 | 1368576
1208 | 384
10009 | 1825856
1201 | 2162176
1206 | 1229824
1213 | 8960
10006 | 4960640
10007 | 67840
1204 | 269341760
1209 | 95488
10003 | 281856
You can see one of the error codes (1204) has an extremely large number of rows compared to the others. It actually has 95.5% of the rows in the table. Because err_code was the DISTKEY, at least 95.5% of the rows were put on one particular node. This is the so-called skew. When such a skew occurs, the total query processing time takes much longer because the performance is capped by the slowest processing node; i.e., the query cannot be spread across multiple nodes. In this (extreme) case, almost all the rows were processed by a single node. That is why the query took longer than the query made against the table without a DISTKEY.
Trying a Different DISTKEY and SORTKEY
Since the values of the column
err_code was too skewed to use as a DISTKEY, you can use the other column
CREATE TABLE system_errors3 (
) DISTKEY(created_at) SORTKEY(created_at);
The same query now takes only 8.32 seconds to return, more than six times faster than the previous query, and more than twice as fast as the very first query.
CPU Utilization is also much better; 10% vs the previous 30%.
Query execution details look good as well. Skew is minimal, and there’s no warning sign for a large data transfer across the network.
If you look at the details of Hash Aggregation, you will notice that the steps are much more simplified compared to the very first query. You can no longer see the double "Aggregate."
Here are a few takeaways to consider:
- Pick a small number of important queries you want to optimize your databases for. You can’t optimize your table for all queries, unfortunately.
- To avoid a large data transfer over the network, define a DISTKEY.
- From the columns used in your queries, choose a column that causes the least amount of skew as the DISTKEY. A column that has many distinct values, such as timestamp, would be a good first choice. Avoid columns with few distinct values, such as credit card types or days of the week.
- Even though it will almost never be the best performer, a table with no DISTKEY/SORTKEY is a decent all-around performer. It’s a good option not to define DISTKEY and SORTKEY until you really understand the nature of your data and queries.
How Integrate.io Streamlines Your Data Pipeline
Integrate.io provides continuous, real-time database replication to Amazon Redshift and other data warehouses. It offers a reliable, powerful way to simplify your data analytics pipeline in a single interface without manual scripting. That will save time for your data team and ensure you generate the results your business needs.
As well as providing real-time database replication, you can move different data types from disparate sources to this data warehouse with Integrate.io’s native ETL connectors. You can also transfer datasets to and from a supported location like Oracle, AWS RDS, and Snowflake without worrying about data integration tasks like primary keys, metadata, schemas, data distribution, and APIs.
Integrate.io is also capable of ELT, Reverse ETL, and CDC and provides data warehouse insights and advanced data observability. This no-code data pipeline platform is the best way to move data from data sources like a transactional or relational database to a target location because it requires a limited skill set and little data engineering knowledge.
Integrate.io’s real-time data replication to Amazon Redshift is just one of the many features of this platform. Schedule a demo now to try Integrate.io for your business use case.