How can I speed up my query on materialized view?

Gavin Cheung 5 Reputation points
2024-01-26T12:00:49.65+00:00

I want to run a simple query on a table (~100M records) with duplicates:

event_table
| where match_id == 12345

In order to deduplicate the table first, I created a materialized view which deduplicates using the unique id column:

.create materialized-view with (backfill=true) event_table_mv on table event_table {
	event_table
	| summarize take_any(*) by id
}

If I compare the speeds of the following queries, I find the is much much slower:

// not using MV
event_table
| where match_id == 12345
| summarize take_any(*) by id

// using MV
event_table_mv
| where match_id == 12345

The summaries for the 2 queries are shown here. User's image User's image

I see that significantly more cold data is being accessed in the MV query, but when I run .show materialized-view event_table_mv details, I see that the HotExtentSize is equal to the TotalExtentSize and the MV is healthy. Is there a way to reduce the query time of the MV so that I can avoid having to deduplicate every time I query the table?

Azure Data Explorer
Azure Data Explorer
An Azure data analytics service for real-time analysis on large volumes of data streaming from sources including applications, websites, and internet of things devices.
501 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Wilko van de Velde 2,226 Reputation points
    2024-01-30T20:56:16.4466667+00:00

    Hi @Gavin Cheung , Because we don't have your dataset, so I have to make a guess. But a Kusto query is treated like a tunnel, we start at the top with lots of data. At each statement, the data if filtered or manipulated.

    Could it be that the filter on match_id removes many records of the original dataset? So that the summarize needs to calculate with a smaller dataset.

    For example:

    event_table						//Contains 100M records
    | where match_id == 12345		//After filtering the dataset contains only 1M records
    | summarize take_any(*) by id	//The summarize only needs to calculate with a set of 1M records
    

    The query with the materialized view is basically:

    event_table						//Contains 100M records
    | summarize take_any(*) by id 	//The summarize only needs to calculate over 100M records
    | where match_id == 12345		//Then the filter is applied
    

    Try executing both queries, I think it the second query will be a slower than the first one.

    I have tried to imitate this situation, but I only see a small difference where the second option is a bit slower. Kind Regards, Wilko

    ---Please do not forget to "Accept the answer” wherever the information provided helps you, this can be beneficial to other community members. If you have extra questions about this answer, please click "Comment".

    1 person found this answer helpful.