Amazon Redshift is a fully managed, cost-effective petabyte-scale cloud-based data warehouse service – allowing businesses to store and analyze large scale data sets and perform massive database migrations. Redshift was born out of PostgreSQL 8.0.2 and is a column-oriented database management system, enabling fast querying and efficient performance.
Redshift allows one to start small for just $0.25 per hour with no commitments and scale-out to petabytes of data for $1,000 per terabyte per year, which is roughly 10% of a traditional data warehouse TCO.
It is capable of very high performance on datasets ranging from a gigabyte to a petabyte. It uses modern data warehousing architectural approaches like columnar storage, data compression, and zone mapping to reduce the I/O needed to perform queries. It is offering significantly better performance than an OLTP data warehouse optimized for the same use case.
Even though Redshift inherits from PostgreSQL, they have several significant differences that one must be aware of when designing and developing data warehouse applications. The following post will outline some of these and emphasize the best practices in this regard.
Table of Contents
PostgreSQL vs Redshift
PostgreSQL is a traditional row-oriented relational database. Redshift is a columnar database specifically designed for online analytical processing (OLAP) and business intelligence (BI) applications, which require complex queries against large datasets. Data is stored on disk per column, resulting in fast operation performance. Additionally, Redshift is compressed, complementing the columnar concept. Compression is a column-level operation that reduces data size when stored – this conserves storage space, reducing the size of data read from storage, reducing the amount of disk I/O – invariably improving query performance. Here’s more information on the PostgreSQL crossovers with Redshift and RDS.
Many Redshift users report poor query performance and high query execution times. However, most of these problems are due to users writing queries geared towards a traditional relational database, which Redshift is not.
A few tips for writing performant queries:
- Avoid using select *. Include only the columns you specifically need.
- Utilize the CASE Expression to perform complex aggregations instead of selecting from the same table repeatedly.
- Use predicates to restrict the dataset.
- Consider using INNER joins as they are more efficient than LEFT joins.
- Stay away from UNION whenever possible.
- Specify multiple levels of conditionals when you can.
- Utilize EXPLAIN to show the query execution plan and cost.
Check out our Query Insights post for more optimization tips.
The Vacuuming process is essential for the health and maintenance of your AWS Redshift cluster – it also affects the performance of queries, among other things. Amazon Redshift does not reclaim free space automatically. When one deletes or updates data from the table, Redshift logically deletes the records by tagging them for deletion. The Vacuum command reclaims disk space occupied by rows tagged for deletion by the previous UPDATE and DELETE operations. This command also sorts the data within the tables when specified.
The Vacuuming process should happen during periods of inactivity or at least minimal activity on your cluster. So query planning is essential. To find out the tables which require vacuuming, run the following query, which returns all the tables with unsorted data of above 10%.
SELECT “schema” + ‘.’ + “table” FROM svv_table_info where unsorted > 10
The following is the vacuum command syntax:
VACUUM [ FULL | SORT ONLY | DELETE ONLY | REINDEX ]
[ [ table_name ] [ TO threshold PERCENT ] ];
Please see the following for a more in-depth guide for vacuuming and analyzing tables.
Amazon Redshift stores one’s data on disk in sorted order according to the sort key. The Amazon Redshift query optimizer then utilizes the sort order when calculating optimal query plans. Query performance is improved when using sort keys correctly, as it enables the query optimizer to read fewer chunks of data filtering out the majority of it.
A sort key is a field in your Redshift database table that determines the disk’s physical storage order—rows stored by the slice are in sort key order.
Amazon Redshift keeps your information on disk in sorted order according to the sort key one defines while creating the Redshift table.
When one creates a table, it is possible to specify one or more columns as the sort key. Amazon Redshift stores data on disk in sorted order according to the sort key. Following column types are the best candidates for Sort Key:
- Timestamp Columns: for frequent data queries, specify the timestamp column as the leading column for the sort key.
- Range Columns: If performing regular range equality or filtering on one column, select that column as the sort key.
- Join Column: If joining a table regularly, specify the join column as both the sort key and the distribution key.
Compression within Redshift is a column-level operation that reduces the size of data in storage; this conserves storage space – minimizing the size of data reads from storage, optimizing the amount of disk I/O, and improving query performance. Redshift provides a handy tool to determine the best encoding for each column within a table. The output will recommend compression for each column.
First, load the data to a test table test_table (or use an existing table) and then execute the following command:
ANALYZE COMPRESSION test_table;
It’s worth noting that Amazon Redshift introduced Zstandard compression In January 2017. The compression algorithm is the new standard and works across all Amazon Redshift data types.
The following is a cheat sheet to follow for all things compression within Redshift.
- Analyze compression on the selected table by using ANALYZE COMPRESSION.
- If ZSTD emerges as the best candidate for columns, consider benchmarking sample data (e.g., six months) with the original and ZSTD column compression.
- If performance is better, deep copy data from the original atomic events table to a new atomic. It may be necessary to resize Redshift or free up additional disk space before performing this action.
- Verify that the two tables contain identical data by comparing a total row count and a row count per day.
- Drop the original table plus any references.
- ALTER the new table to atomic events. Ensure that the latest table has an identical owner.
As discussed in the post, introducing best practices, technical hygiene, and maintenance is one way of carrying out performance tuning on Redshift. That’s why we’ve built Integrate.io, a powerful Redshift analytics platform that provides a single user-friendly dashboard to monitor what’s going on in your AWS environment smoothly – we have abstracted laborious and time-consuming tasks saving you time and resources. We are the best Redshift pit crew in the game. Want to try it out for yourself? Sign up today for a free trial.