It is common to connect an application framework like Django to Amazon Redshift. This is useful when using Redshift data in your application, i.e. in an OLTP scenario. Since Amazon Redshift is an OLAP database, it may not handle these queries well.
The challenge of using Redshift as an OLTP database is that queries can lack the low-latency that would exist on a traditional RDBMS and transactional queries. Unlike OLTP databases, OLAP databases do not use an index. This is a result of the column-oriented data storage design of Amazon Redshift which makes the trade-off to perform better for big data analytical workloads.
Consider this example from a live production cluster. The user ‘django_redshift’ is querying the table ‘search_word_level_course_vector”, a table with 443,744 rows. The query ran 374,372 times. Each query returned a single row.
Each query scans all 443,744 rows, takes about 0.02 seconds to run and returns a single row.
The impact on the cluster is quite dramatic:
- 374,371 queries @ 0.02s per query equal 7,487 seconds, or 125 minutes of query time. The commit queue backs up with all these requests, impacting the execution time of all other queries running in the cluster
- The query volume drives up concurrency and may exceed the # of available WLM slots, which results in queue wait times for other queries running in that queue.
There two approaches to resolve the problem.
- Re-write the queries to select all 443,744 rows of the table, and then parse each row in application memory. Doing so would remove 374,371 queries from your Redshift database. Such a single query would take a few seconds, instead of 125 minutes.
- Use Amazon RDS and DBLINK to use Redshift as an OLTP. It the post “Have your Postgres Cake and Eat it Too” we describe that approach in detail.