Troubleshoot a slow query on a dedicated SQL Pool
Applies to: Azure Synapse Analytics
This article helps you identify the reasons and apply mitigations for common performance issues with queries on an Azure Synapse Analytics dedicated SQL pool.
Follow the steps to troubleshoot the issue or execute the steps in the notebook via Azure Data Studio. The first three steps walk you through collecting telemetry, which describes the lifecycle of a query. The references at the end of the article help you analyze potential opportunities found in the data collected.
Note
Before attempting to open this notebook, make sure that Azure Data Studio is installed on your local machine. To install it, go to Learn how to install Azure Data Studio.
Important
Most of the reported performance issues are caused by:
- Outdated statistics
- Unhealthy clustered columnstore indexes (CCIs)
To save troubleshooting time, make sure that the statistics are created and up-to-date and CCIs have been rebuilt.
Step 1: Identify the request_id (aka QID)
The request_id
of the slow query is required to research potential reasons for a slow query. Use the following script as a starting point for identifying the query you want to troubleshoot. Once the slow query is identified, note down the request_id
value.
-- Monitor active queries
SELECT *
FROM sys.dm_pdw_exec_requests
WHERE [status] NOT IN ('Completed','Failed','Cancelled')
AND session_id <> session_id()
-- AND [label] = '<YourLabel>'
-- AND resource_allocation_percentage is not NULL
ORDER BY submit_time DESC;
-- Find top 10 longest running queries
SELECT TOP 10 *
FROM sys.dm_pdw_exec_requests
ORDER BY total_elapsed_time DESC;
To better target the slow queries, use the following tips when you run the script:
Sort by either
submit_time DESC
ortotal_elapsed_time DESC
to have the longest-running queries present at the top of the result set.Use
OPTION(LABEL='<YourLabel>')
in your queries and then filter thelabel
column to identify them.Consider filtering out any QIDs that don't have a value for
resource_allocation_percentage
when you know the target statement is contained in a batch.Note: Be cautious with this filter as it may also filter out some queries that are being blocked by other sessions.
Step 2: Determine where the query is taking time
Run the following script to find the step that may cause the performance issue of the query. Update the variables in the script with the values described in the following table. Change the @ShowActiveOnly
value to 0 to get the full picture of the distributed plan. Take note of the StepIndex
, Phase
, and Description
values of the slow step identified from the result set.
Parameter | Description |
---|---|
@QID |
The request_id value obtained in Step 1 |
@ShowActiveOnly |
0 - Show all steps for the query 1 - Show only the currently active step |
DECLARE @QID VARCHAR(16) = '<request_id>', @ShowActiveOnly BIT = 1;
-- Retrieve session_id of QID
DECLARE @session_id VARCHAR(16) = (SELECT session_id FROM sys.dm_pdw_exec_requests WHERE request_id = @QID);
-- Blocked by Compilation or Resource Allocation (Concurrency)
SELECT @session_id AS session_id, @QID AS request_id, -1 AS [StepIndex], 'Compilation' AS [Phase],
'Blocked waiting on '
+ MAX(CASE WHEN waiting.type = 'CompilationConcurrencyResourceType' THEN 'Compilation Concurrency'
WHEN waiting.type LIKE 'Shared-%' THEN ''
ELSE 'Resource Allocation (Concurrency)' END)
+ MAX(CASE WHEN waiting.type LIKE 'Shared-%' THEN ' for ' + REPLACE(waiting.type, 'Shared-', '')
ELSE '' END) AS [Description],
MAX(waiting.request_time) AS [StartTime], GETDATE() AS [EndTime],
DATEDIFF(ms, MAX(waiting.request_time), GETDATE())/1000.0 AS [Duration],
NULL AS [Status], NULL AS [EstimatedRowCount], NULL AS [ActualRowCount], NULL AS [TSQL]
FROM sys.dm_pdw_waits waiting
WHERE waiting.session_id = @session_id
AND ([type] LIKE 'Shared-%' OR
[type] in ('ConcurrencyResourceType', 'UserConcurrencyResourceType', 'CompilationConcurrencyResourceType'))
AND [state] = 'Queued'
GROUP BY session_id
-- Blocked by another query
UNION ALL
SELECT @session_id AS session_id, @QID AS request_id, -1 AS [StepIndex], 'Compilation' AS [Phase],
'Blocked by ' + blocking.session_id + ':' + blocking.request_id + ' when requesting ' + waiting.type + ' on '
+ QUOTENAME(waiting.object_type) + waiting.object_name AS [Description],
waiting.request_time AS [StartTime], GETDATE() AS [EndTime],
DATEDIFF(ms, waiting.request_time, GETDATE())/1000.0 AS [Duration],
NULL AS [Status], NULL AS [EstimatedRowCount], NULL AS [ActualRowCount],
COALESCE(blocking_exec_request.command, blocking_exec_request.command2) AS [TSQL]
FROM sys.dm_pdw_waits waiting
INNER JOIN sys.dm_pdw_waits blocking
ON waiting.object_type = blocking.object_type
AND waiting.object_name = blocking.object_name
INNER JOIN sys.dm_pdw_exec_requests blocking_exec_request
ON blocking.request_id = blocking_exec_request.request_id
WHERE waiting.session_id = @session_id AND waiting.state = 'Queued'
AND blocking.state = 'Granted' AND waiting.type != 'Shared'
-- Request Steps
UNION ALL
SELECT @session_id AS session_id, @QID AS request_id, step_index AS [StepIndex],
'Execution' AS [Phase], operation_type + ' (' + location_type + ')' AS [Description],
start_time AS [StartTime], end_time AS [EndTime],
total_elapsed_time/1000.0 AS [Duration], [status] AS [Status],
CASE WHEN estimated_rows > -1 THEN estimated_rows END AS [EstimatedRowCount],
CASE WHEN row_count > -1 THEN row_count END AS [ActualRowCount],
command AS [TSQL]
FROM sys.dm_pdw_request_steps
WHERE request_id = @QID
AND [status] = CASE @ShowActiveOnly WHEN 1 THEN 'Running' ELSE [status] END
ORDER BY StepIndex;
Step 3: Review step details
Run the following script to review the details of the step identified in the previous step. Update the variables in the script with the values described in the following table. Change the @ShowActiveOnly
value to 0 to compare all distribution timings. Take note of the wait_type
value for the distribution that may cause the performance issue.
Parameter | Description |
---|---|
@QID |
The request_id value obtained in Step 1 |
@StepIndex |
The StepIndex value identified in Step 2 |
@ShowActiveOnly |
0 - Show all distributions for the given StepIndex value1 - Show only the currently active distributions for the given StepIndex value |
DECLARE @QID VARCHAR(16) = '<request_id>', @StepIndex INT = <StepIndex>, @ShowActiveOnly BIT = 1;
WITH dists
AS (SELECT request_id, step_index, 'sys.dm_pdw_sql_requests' AS source_dmv,
distribution_id, pdw_node_id, spid, 'NativeSQL' AS [type], [status],
start_time, end_time, total_elapsed_time, row_count
FROM sys.dm_pdw_sql_requests
WHERE request_id = @QID AND step_index = @StepIndex
UNION ALL
SELECT request_id, step_index, 'sys.dm_pdw_dms_workers' AS source_dmv,
distribution_id, pdw_node_id, sql_spid AS spid, [type],
[status], start_time, end_time, total_elapsed_time, rows_processed as row_count
FROM sys.dm_pdw_dms_workers
WHERE request_id = @QID AND step_index = @StepIndex
)
SELECT sr.step_index, sr.distribution_id, sr.pdw_node_id, sr.spid,
sr.type, sr.status, sr.start_time, sr.end_time,
sr.total_elapsed_time, sr.row_count, owt.wait_type, owt.wait_time
FROM dists sr
LEFT JOIN sys.dm_pdw_nodes_exec_requests owt
ON sr.pdw_node_id = owt.pdw_node_id
AND sr.spid = owt.session_id
AND ((sr.source_dmv = 'sys.dm_pdw_sql_requests'
AND sr.status = 'Running') -- sys.dm_pdw_sql_requests status
OR (sr.source_dmv = 'sys.dm_pdw_dms_requests'
AND sr.status not LIKE 'Step[CE]%')) -- sys.dm_pdw_dms_workers final statuses
WHERE sr.request_id = @QID
AND ((sr.source_dmv = 'sys.dm_pdw_sql_requests' AND sr.status =
CASE WHEN @ShowActiveOnly = 1 THEN 'Running' ELSE sr.status END)
OR (sr.source_dmv = 'sys.dm_pdw_dms_workers' AND sr.status NOT LIKE
CASE WHEN @ShowActiveOnly = 1 THEN 'Step[CE]%' ELSE '' END))
AND sr.step_index = @StepIndex
ORDER BY distribution_id
Step 4: Diagnose and mitigate
Compilation phase issues
According to the
Description
values obtained in Step 2, check the relevant section for more information from the following table.Description Common Cause Compilation Concurrency
Blocked: Compilation Concurrency Resource Allocation (Concurrency)
Blocked: resource allocation If the query is in "Running" status identified in Step 1, but there's no step information in Step 2, check the cause that best fits your scenario to get more information from the following table.
Scenario Common Cause Statement contains complex join-filter logic or performs joins in WHERE
clauseComplex query or older JOIN syntax Statement is a long-running DROP TABLE
orTRUNCATE TABLE
statementLong-running DROP TABLE or TRUNCATE TABLE CCIs have high percentage of deleted or open rows (see Optimizing clustered columnstore indexes) Unhealthy CCIs (generally) Analyze the result set in Step 1 for one or more
CREATE STATISTICS
statements executed immediately after the slow query submission. Check the cause that best fits your scenario from the following table.Scenario Common Cause Statistics created unexpectedly Delay from auto-create statistics Statistics creation failed after 5 minutes Auto-create statistics timeouts
Blocked: Compilation Concurrency
Concurrency Compilation blocks rarely occur. However, if you encounter this type of block, it signifies that a large volume of queries were submitted in a short time and have been queued to begin compilation.
Mitigations
Reduce the number of queries submitted concurrently.
Blocked: resource allocation
Being blocked for resource allocation means that your query is waiting to execute based on:
- The amount of memory granted based on the resource class or workload group assignment associated with the user.
- The amount of available memory on the system or workload group.
- (Optional) The workload group/classifier importance.
Mitigations
- Wait for the blocking session to complete.
- Evaluate the resource class choice. For more information, see concurrency limits.
- Evaluate if it's preferable to Kill the blocking session.
Complex query or older JOIN syntax
You may encounter a situation where the default query optimizer methods are proven ineffective as the compilation phase takes a long time. It may occur if the query:
- Involves a high number of joins and/or subqueries (complex query).
- Utilizes joiners in the
FROM
clause (not ANSI-92 style joins).
Though these scenarios are atypical, you have options to attempt to override the default behavior to reduce the time it takes for the query optimizer to choose a plan.
Mitigations
- Use ANSI-92 style joins.
- Add query hints:
OPTION(FORCE ORDER, USE HINT ('FORCE_LEGACY_CARDINALITY_ESTIMATION'))
. For more information, see FORCE ORDER and Cardinality Estimation (SQL Server). - Break the query into multiple, less complex steps.
Long-running DROP TABLE or TRUNCATE TABLE
For execution time efficiencies, the DROP TABLE
and TRUNCATE TABLE
statements will defer storage cleanup to a background process. However, if your workload performs a high number of DROP
/TRUNCATE TABLE
statements in a short time frame, it's possible that metadata becomes crowded and causes subsequent DROP
/TRUNCATE TABLE
statements to execute slowly.
Mitigations
Identify a maintenance window, stop all workloads, and run DBCC SHRINKDATABASE to force an immediate cleanup of previously dropped or truncated tables.
Unhealthy CCIs (generally)
Poor clustered columnstore index (CCI) health requires extra metadata, which can cause the query optimizer to take more time to determine an optimal plan. To avoid this situation, ensure that all of your CCIs are in good health.
Mitigations
Assess and correct clustered columnstore index health in a dedicated SQL pool.
Delay from auto-create statistics
The automatic create statistics option, AUTO_CREATE_STATISTICS
is ON
by default to help ensure the query optimizer can make good distributed plan decisions. However, the auto-creation process itself can make an initial query take longer than subsequent executions of the same.
Mitigations
If the first execution of query consistently requires statistics to be created, you'll need to manually create statistics prior to the execution of the query.
Auto-create statistics timeouts
The automatic create statistics option, AUTO_CREATE_STATISTICS
is ON
by default to help ensure the query optimizer can make good distributed plan decisions. The auto-creation of statistics occurs in response to a SELECT statement and has a 5-minute threshold to complete. If the size of data and/or the number of statistics to be created require longer than the 5-minute threshold, the auto-creation of statistics will be abandoned so that the query can continue execution. The failure to create the statistics can negatively impact the query optimizer's ability to generate an efficient distributed execution plan, resulting in poor query performance.
Mitigations
Manually create the statistics instead of relying on the auto-create feature for the identified tables/columns.
Execution phase issues
Use the following table to analyze the result set in Step 2. Determine your scenario and check the common cause for detailed information and the possible mitigation steps.
Scenario Common Cause EstimatedRowCount
/ActualRowCount
< 25%Inaccurate estimates The Description
value indicatesBroadcastMoveOperation
and the query references a replicated table.Uncached replicated tables 1. @ShowActiveOnly
= 0
2. High or unexpected number of steps (step_index
) is observed.
3. Data types of joiner columns aren't identical between tables.Mismatched data type/size 1. The Description
value indicatesHadoopBroadcastOperation
,HadoopRoundRobinOperation
orHadoopShuffleOperation
.
2. Thetotal_elapsed_time
value of a givenstep_index
is inconsistent between executions.Ad hoc external table queries Check the
total_elapsed_time
value obtained in Step 3. If it's significantly higher in a few distributions in a given step, follow the these steps:Check the data distribution for every table referenced in the
TSQL
field for associatedstep_id
by running the following command against each:DBCC PDW_SHOWSPACEUSED(<table>);
If <minimum rows value>/<maximum rows value> > 0.1, go to Data skew (stored).
Otherwise, go to In-flight data skew.
Inaccurate estimates
Have your statistics up-to-date to ensure that the query optimizer generates an optimal plan. When the estimated row count is significantly less than the actual counts, the statistics need to be maintained.
Mitigations
Uncached replicated tables
If you have created replicated tables, and you fail to warm the replicated table cache properly, unexpected poor performance will result due to extra data movements or the creation of a suboptimal distributed plan.
Mitigations
- Warm the replicated cache after DML operations.
- If there are frequent DML operations, change the distribution of the table to
ROUND_ROBIN
.
Mismatched data type/size
When joining tables, make sure that the data type and size of the joining columns match. Otherwise, it will result in unnecessary data movements that will decrease the availability of CPU, IO, and network traffic to the remainder of the workload.
Mitigations
Rebuild the tables to correct the related table columns that don't have identical data type and size.
Ad hoc external table queries
Queries against external tables are designed with the intention of bulk loading data into the dedicated SQL pool. Ad hoc queries against external tables may suffer variable durations due to external factors, such as concurrent storage container activities.
Mitigations
Load data into the dedicated SQL pool first and then query the loaded data.
Data skew (stored)
Data skew means the data isn't distributed evenly across the distributions. Each step of the distributed plan requires all distributions to complete before moving to the next step. When your data is skewed, the full potential of the processing resources, such as CPU and IO, can't be achieved, resulting in slower execution times.
Mitigations
Review our guidance for distributed tables to assist your choice of a more appropriate distribution column.
In-flight data skew
In-flight data skew is a variant of the data skew (stored) issue. But, it's not the distribution of data on disk that is skewed. The nature of the distributed plan for particular filters or grouped data causes a ShuffleMoveOperation
type operation. This operation produces a skewed output to be consumed downstream.
Mitigations
- Make sure that statistics are created and up-to-date.
- Change the order of your
GROUP BY
columns to lead with a higher-cardinality column. - Create multi-column statistics if joins cover multiple columns.
- Add query hint
OPTION(FORCE_ORDER)
to your query. - Refactor the query.
Wait type issues
If none of the above common issues apply to your query, the Step 3 data affords the opportunity to determine which wait types (in wait_type
and wait_time
) are interfering with query processing for the longest-running step. There are a large number of wait types, and they're grouped into related categories due to similar mitigations. Follow these steps to locate the wait category of your query step:
- Identify the
wait_type
in Step 3 that is taking the most time. - Locate the wait type in wait categories mapping table and identify the wait category it included in.
- Expand the section related to the wait category from the following list for recommended mitigations.
Compilation
Follow these steps to mitigate wait type issues of the Compilation category:
- Rebuild indexes for all objects involved in the problematic query.
- Update statistics on all objects involved in the problematic query.
- Test the problematic query again to validate whether the issue persists.
If the issue persists, then:
Create a .sql file with:
SET QUERY_DIAGNOSTICS ON; <Your_SQL>; SET QUERY_DIAGNOSTICS OFF;
Open a Command Prompt window and run the following command:
sqlcmd −S <servername>.database.windows.net −d <databasename> −U <username> −G −I −i .\<sql_file_name>.sql −y0 −o .\<output_file_name>.txt
Open <output_file_name>.txt in a text editor. Locate and copy paste the distribution-level execution plans (lines that begin with
<ShowPlanXML>
) from the longest-running step identified in Step 2 into separate text files with a .sqlplan extension.Note: Each step of the distributed plan will typically have recorded 60 distribution-level execution plans. Make sure that you're preparing and comparing execution plans from the same distributed plan step.
The Step 3 query frequently reveals a few distributions that take much longer than others. In SQL Server Management Studio, compare the distribution-level execution plans (from the .sqlplan files created) of a long-running distribution to a fast-running distribution to analyze potential causes for differences.
Lock, Worker Thread
- Consider changing tables that undergo frequent, small changes to utilize a row store index instead of CCI.
- Batch up your changes and update the target with more rows on a less frequent basis.
Buffer IO, Other Disk IO, Tran Log IO
Unhealthy CCIs
Unhealthy CCIs contribute to increased IO, CPU, and memory allocation, which, in turn, negatively impacts the query performance. To mitigate this issue, try one of the following methods:
- Assess and correct clustered columnstore index health in a dedicated SQL pool.
- Run and review the output of the query listed at Optimizing clustered columnstore indexes to get a baseline.
- Follow the steps to rebuild indexes to improve segment quality, targeting the tables involved in the example problem query.
Outdated statistics
Outdated statistics can cause the generation of an unoptimized distributed plan, which involves more data movement than necessary. Unnecessary data movement increases the workload not only on your data at rest but also on the tempdb
. Because IO is a shared resource across all queries, performance impacts can be felt by the entire workload.
To remedy this situation, ensure all statistics are up-to-date, and a maintenance plan is in place to keep them updated for user workloads.
Heavy IO workloads
Your overall workload may be reading large amounts of data. Synapse dedicated SQL pools scale resources in accordance with the DWU. In order to achieve better performance, consider either or both:
- Utilizing a larger resource class for your queries.
- Increase compute resources.
CPU, Parallelism
Scenario | Mitigation |
---|---|
Poor CCI Health | Assess and correct clustered columnstore index health in a dedicated SQL pool |
User queries contain transformations | Move all formatting and other transformation logic into ETL processes so the formatted versions are stored |
Workload improperly prioritized | Implement workload isolation |
Insufficient DWU for workload | Consider increasing compute resources |
Network IO
If the issue occurs during a RETURN
operation in Step 2,
- Reduce the number of concurrent parallel processes.
- Scale out the most impacted process to another client.
For all other data movement operations, it's probable that the network issues appear to be internal to the dedicated SQL pool. To attempt to quickly mitigate this issue, follow these steps:
- Scale your dedicated SQL pool to DW100c
- Scale back to your desired DWU level
SQL CLR
Avoid frequent use of the FORMAT()
function by implementing an alternate way of transforming the data (for example, CONVERT()
with style).