Amazon Redshift is a distributed, shared-nothing database that scales horizontally across multiple nodes. Query execution time is very tightly correlated with:

  • the # of rows and data a query processes.
  • the amount of data moving between nodes.

Below is an example of a poorly written query, and two optimizations to make it run faster.

Optimization #1: Limit Rows Processed by using a WHERE clause

Queries can run faster by minimizing the amount of data moving between nodes. In practice, this means being careful when writing multi-stage queries where the results of one stage feeds into the next.

In the case of our example query, modifying your ‘WHERE’ clauses to only select rows needed will minimize the amount of data that needs to be moved around and speed up the query.

Optimization #2: Limit Columns Scanned

Amazon Redshift is a columnar-oriented database. As a result, scanning a table doesn’t read each row in its entirety. Instead, individual columns can be scanned without needing to read other columns. This means that you should be careful to only select columns that you will use for your query. Try avoiding using a

SELECT *

operation in all cases.

Using these two optimizations when writing queries will have dramatic positive effects on your query speeds.