Maintain Accurate Table Statistics

Amazon Redshift builds a custom query execution plan for every query. For a given query plan, an amount of memory is allocated. The memory allocation is determined by estimating the amount of memory needed to store intermediate query results (as in a JOIN or aggregation).

The query plan allocates a certain amount of memory to each query, by estimating the amount of memory needed to store intermediate results (e.g. for a JOIN or aggregation).

It is important for a query to have sufficient memory to not spill to disk (go “disk-based”). Allocating too much memory is not desirable either, however. Queries do not share memory. Allocating more memory than needed wastes memory since it is unavailable to other queries.

An important point is that the system is not adaptive. If the plan was wrong and the query needs more (or less) memory than was allocated – the execution engine will not go back and adjust the memory allocation after the query has already started executing.

What could cause the plan to be wrong? A major factor is the number of rows in a table.

The ANALYZE command will ensure that the planner has an accurate, up-to-date view of the row counts for tables. Let’s look at an example of what happens if the statistics are wrong.

EXAMPLE 1 – Table has more rows than the planner thinks it has

In this example, the planner will allocate too little memory to the query. Once the query starts running, it will encounter that it requires more memory than it was allocated. The query will go disk-based and thus will run slower than otherwise.

This could have been avoided by running the query in a slot with enough memory. It would not have gone disk-based.

EXAMPLE 2 – Table has fewer rows than the planner thinks it has

In this example, the planner will allocate too much memory to the query. Once the query starts running it will encounter that it requires less memory to store intermediate results than it was allocated. There will be one of two possible results:

The query will not go disk-based, but it used up too much memory. That may cause other queries to go disk-based.
The query was allocated more memory than was available in the slot it ran in, and the query goes disk-based. This could have been avoided with up-to-date statistics.

Running ANALYZE

Amazon Redshift provides a statistics called “stats off” to help determine when to run the ANALYZE command on a table. The “stats off” metric is the positive % difference between the actual number of rows and the number of rows seen by the planner.