9/26/2023 0 Comments Redshift sortkey and distkeyBecause err_code was the DISTKEY, at least 95.5% of the rows were put on one particular node. It actually has 95.5% of the rows in the table. You can see one of the error codes (1204) has an extremely large number of rows compared to the others. Running the following query will tell you how many rows each err_code has: SELECT err_code, count(*) FROM system_errors2 GROUP BY err_code The slowest node must have had more rows than the other nodes. In this case, it took four times more than the average. This means that the slowest node took significantly longer than the average processing time. Why did this happen? The query’s execution details will provide insights: Why would the query against this table be even slower than the table with no DISTKEY/SORTKEY?Īfter creating two tables with and without DISTKEY, the table with the DISTKEY was much slower than the other: no DISTKEY -> 20.52 seconds This new table puts all rows of an error code on the same node and stores them next to each other. Shockingly, this query took 54.9 seconds! That’s 2.5 times slower than the first query against a table with no DISTKEY/SORTKEY. Take a look at how the query performs against the new table: SELECT err_code, created_at, count(*) FROM system_errors2 GROUP BY created_at, err_code If you want to run a query grouped by err_code without created_at, choose err_code as DISTKEY and SORTKEY. To collocate all relevant rows in a single node, you can use either the column err_code or created_at as the DISTKEY. This can be done by defining the DISTKEY. You can avoid this by putting all rows sharing the same err_code and created_at values on a single node. More importantly, a large amount of data was sent to the leader node across the network, which became the performance bottleneck. That's why you see two "Aggregate" steps in the above screenshot. Each node must aggregate its own rows first then, the leader node has to aggregate the results again. This warning occurred because rows that need to be aggregated (rows sharing the same err_code and created_at values) are spread across multiple compute nodes. Look at the warning sign! Something must have been wrong: Thankfully, it offers useful graphs and metrics to analyze query performance.īelow are the "Query Execution Details" for the query: Now it’s time to check query performance by analyzing the Amazon Redshift Console. You can improve this by investigating query performance.ĭISTKEY Redshift: Investigating The Query But if you look at the CPU usage, both compute nodes were used up to 30% of CPU. This isn’t too bad, considering the number of rows in the table. On this Redshift cluster (2-node dc1.large), the query took 20.52 seconds to execute. You can create a table without the DISTKEY before setting it and see how the query performs: CREATE TABLE system_errors1 ( The query gets the number of errors per error type for each time slice: SELECT err_code, created_at, count(*) FROM system_errors1 GROUP BY created_at, err_code Here, there is a query that needs optimizing. Each table has 282 million rows (lots of errors!). Each record of the table consists of an error that happened on a system with its (1) timestamp, and (2) error code. This example uses a series of tables called system_errors# where # is a series of numbers.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |