Questions about tdigest* KQL functions

Question

In the documentation for the tdigest function, it shows sample outputs in the following format but doesn't explain what each of the nested arrays represent.

[[n],[a,b,c],[d,e,f]]

It appears the second array is every unique sample that was in the input and the third array is the number of occurrences for each of them. Is that correct and what does the first array represent?

Also, since the digest produced contains every unique sample and doesn't do any centroid compression or calculate/return trimmed-means, this doesn't appear to be a true t-digest and increases the query time. Are there plans to optimize the implementation?

Accepted Answer

Hi Kyle Burney,

Thanks for reaching out to Microsoft Q&A.

The output format of the tdigest function in Azure Data Explorer is structured as below:

[[n],[a,b,c],[d,e,f]]. Each nested array represents different components of the t-digest aggregation results:

First Array ([n]): This array contains a single value, which represents the total number of samples processed. In the context of the t-digest function, this indicates how many data points were aggregated.
Second Array ([a,b,c]): This array lists every unique sample that was present in the input data. It reflects the distinct values that contributed to the aggregation.
Third Array ([d,e,f]): This array shows the count of occurrences for each of the unique samples listed in the second array. Each element corresponds to the frequency of the respective sample in the input data.

Your concern about the tdigest function not performing centroid compression or returning trimmed-means, and thus not being a true t-digest, is valid. A true t-digest algorithm involves maintaining a set of centroids that approximate the distribution of data, allowing for more efficient quantile estimation with bounded error. The current function in azure data expl seems to focus on preserving all unique samples and their counts, which could indeed lead to increased query times and larger data structures, especially with large datasets.

There are no specific details available in the docs about future optimizations for this implementation, but feedback mechanisms are in place for users to express their needs and suggestions for improvements. Would suggest keeping an eye on the Azure Data Explorer release notes for updates.

https://video2.skills-academy.com/ro-ro/azure/data-explorer/kusto/query/tdigest-aggregation-function

Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.

Share via

Questions about tdigest* KQL functions

0 additional answers

Your answer