Avoid Row Skew

Row Skew results when a table uses KEY based distribution and the values in the DISTKEY column are not evenly distributed. The row skew metrics is a positive integer ranging from 1 to the number of rows in the table. Row skew is the ratio of:

number of rows on the node containing the most number of rows for the table
number of rows on the node containing the least number of rows for the table

High row skew results in uneven node disk utilization (cost) and slower queries (performance).

The chart below shows a real-world example. With uneven disk utilization, a single node(s) ends up having more rows for that table. This can be a major (cost) problem if you need to add more nodes in your cluster just because a single node is skewed.

With high row skew, doing a straight SELECT on that table will be slower than otherwise. This is because one node has more data than the next, and the query execution must wait for the “slowest” node to send up its data to the leader.

There are two options to eliminate row skew:

selecting a DISTKEY that is random, or
change the distribution style to EVEN or ALL

The exception to tolerate row skew is if – and only if – you make a conscious decision to optimize a single query. See the section “Use DISTKEYs Only When Necessary” in this article for more information.

Big Data

Avoid Row Skew

Mastering SQL Queries in Excel

ETL Developer vs Data Engineer: Key Differences

How Enterprise Automation Transforms Workflows

Solutions

Support

Company

Language

Avoid Row Skew

Related Readings

Mastering SQL Queries in Excel

ETL Developer vs Data Engineer: Key Differences

How Enterprise Automation Transforms Workflows

Subscribe To The Stack Newsletter

Solutions

Support

Company

Language

Subscribe To
The Stack Newsletter