Some highlights of the March and April 2017 U-SQL refreshs: PIVOT/UNPIVOT, Sharing across ADLA accounts, U-SQL Packages, fast file sets, new default numbers of HASH DISTRIBUTION buckets, easier C# integration with CROSS APPLY
The last two Azure Data Lake refreshs in March and April brought a lot of interesting new capabilities in U-SQL.
You can find the details by following the two links to our Azure Data Lake blog posts above which in turn will take you to the detailed release notes on our Azure Data Lake GitHub site.
In this post I would like to call out some of the highlights.
Based on popular customer demand and frequent questions, we added PIVOT
and UNPIVOT
expressions. Like in SQL Server, they need a static list of values to pivot on. But unlike SQL Server, UNPIVOT
also provides you with the ability to include rows that have null
values! If you need a more dynamic pivot, we still recommend that you use a SqlMap
typed column for the "unknown" columns.
The March refresh also added the ability to access U-SQL Catalog objects such as tables and assemblies across different Azure Data Lake Analytics accounts. The only requirement is that they share the same Azure Active Directory context. This for example allows you to set up a global ADLA and ADLS account that contains files and U-SQL databases in one subscription and then give different teams read and execute access to the objects from their own ADLA accounts in their own subscription. These ADLA accounts in turn have their own default ADLS accounts so team specific objects are written (and billed) into the team specific store.
U-SQL packages are a way to package up more complex assembly references, variable declarations and resource deployments so a script writer can import a single package and get all the references and variable declarations in a single statement. It also allows developers to provide an abstraction to the script writers behind which they can refactor their assembly structures without impact on the scripts.
The easier C# integration with CROSS APPLY should make it easier to use the CROSS APPLY with simple set-based C# expressions without the need to have to use a custom applier or wrap everything into a SqlMap
or SqlArray
data type.
One problem we noticed with U-SQL tables was that a lot of people just used the HASH
distribution scheme without specifying an explicit number of hash buckets. This often lead to over- or under-partitioning of the table which negatively impacted the performance experience when loading or querying such tables. Therefore, we introduced in April a new algorithm to determine the number of hash buckets on the first insertion into a table (or table partition). The goal is to get a more optimal distribution size with better out-of-the-box performance experience.
Speaking of performance experience: We now also explain how to enable the preview of faster, more scalable file sets! If you use file sets in your U-SQL queries, there is no reason not to try it! If you do please report back here in the comments or through other channels (twitter, email etc) how it works for you!
For more details and many more features and updates please review the summaries on our Azure Data Lake blog and finally go into the details on the release notes themselves!April