There are several common reasons why you may see your Redshift queries running slowly. Review this guide for information on how to improve Redshift query performance to speed up your queries. If you are still seeing performance issues after applying the tips in this guide, feel free to reach out to email@example.com.
Check to see if you’re running out of space
The query below will give you the percentage of storage used in your cluster, which you can use to check if this is the reason for slow query performance. If you’re nearing capacity (80% or greater), consider adding some more nodes or truncating some unused historical data.
SELECT sum(pct_used) FROM svv_table_info;
Write specific, well-structured queries
Writing well-structured queries that are as specific as possible will allow you to get results as quickly as possible. As an example, you might run this query to search for a certain set of data for ALL time (3 years).
SELECT * FROM table;
In comparison, if you were to run this query to search for this same data over only the last month, the results will appear faster.
SELECT * FROM table WHERE time > '2020-10-01'
As another example, if you select for every column (ex. 100 columns) of data on the table, the query will run significantly slower than one for just your necessary columns (ex. 2) such as the query below.
SELECT name, email FROM table
Depending on your desired results, a combination of the two would be most effective.
SELECT name, email FROM table WHERE time > '2020-10-01'
When running JOIN clauses, you should always try (if possible) to incorporate the table’s `distkey` (a property specific to Redshift, and always the ‘user_id’ column in Heap schema) in the JOIN clause. For example:
SELECT ... FROM <schema>.<table_name> LEFT JOIN <schema>.<other_table_name> ON <table_name>.session_id = <other_table_name>.session_id
The example above does not use the distkey “user_id” in the JOIN clause. Here’s a faster version that uses the distkey in the JOIN clause.
SELECT ... FROM <schema>.<table_name> LEFT JOIN <schema>.<other_table_name> USING (user_id, session_id)
Stagger ETL processes to avoid overlap
If you have multiple ETL processes loading into your warehouse at the same time, everything will slow down. Try to schedule them at different times when your cluster is least active.
For more tips, see the following Amazon docs on designing Redshift queries:
- Amazon Redshift best practices for designing queries
- Top 10 performance tuning techniques for Amazon Redshift
To learn more about writing SQL, check out these resources: