CREATE TABLE (U-SQL): Creating a Table with Schema
Summary
U-SQL allows a managed table to be created by specifying a schema. The table will have to have a clustered index specified in order to be able to contain data and the table will be partitioned.
Note
All managed U-SQL tables are currently clustered tables where the cluster information is specified with a clustered index. In particular, other types of tables such as heaps and column store tables are not supported.
When creating a managed U-SQL table with a schema, a table schema has to be provided that contains at least one table column definition.
Syntax
Create_Managed_Table_With_Schema_Statement := 'CREATE' 'TABLE' ['IF' 'NOT' 'EXISTS'] Identifier Table_With_Schema.
Table_With_Schema := '(' { Column_Definition ',' } [ Table_Index Partition_Specification ] { ',' Column_Definition } ')' | '(' { Column_Definition ',' } [ Table_Index ] { ',' Column_Definition } ')' Partition_Specification.
Remarks
Identifier
Specifies the name of the schema. If theIdentifier
is a three-part identifier, the table will be created in the specified database and schema. If it is a two-part identifier, then the table will be created in the specified schema of the current database context. If the identifier is a simple identifier, then the table will be created in the current database and schema context.If a table or other object of the given name already exists in the specified database and schema context or the user has no permissions to create a table, an error is raised.
IF NOT EXISTS
If the optionalIF NOT EXISTS
is specified, then the statement creates the table if it does not already exist, or succeeds without changes if the table already exists and the user has permission to at least enumerate all existing tables.Table_With_Schema
A table schema contains at least one column definition (note the above syntax is slightly simplified for readability) and can optionally contain a table index and a partition specification. The table index is optional in the definition of the table and the partition specification can either follow the index specification or can follow the table schema. If the partition specification follows the schema, then the schema needs to contain the index specification.Note
While the index definition is optional as part of the table definition, no data can be inserted into the table until an index has been defined. If the index definition is not part of the table definition, then a CREATE CLUSTERED INDEX statement has to be executed before data can be inserted into the table.
Syntax
Column_Definition := Quoted_or_Unquoted_Identifier Built_in_Type.
Each column has an identifier that can be either a quoted or unquoted identifier which is typed with one of the built-in U-SQL types. Note that there are currently no constraints such as primary key, foreign key, unique etc. supported. Unlike in the case of traditional SQL tables, nullability is part of the type and not a column property.
Table_Index
The table index defines the clustered index of the table. It specifies the name of the index that is local to the table as a quoted or unquoted identifier and by providing a list of columns that determine how and in which order the rows will be ordered. It basically determines how the data will be physically stored in the clustered table.The table index syntax looks like
Syntax
Table_Index := 'INDEX' Quoted_or_Unquoted_Identifier 'CLUSTERED' '(' Sort_Item_List ')'.
Sort_Item_List := Sort_Item {',' Sort_Item}.
Sort_Item := Quoted_or_Unquoted_Identifier [Sort_Direction].
Sort_Direction := 'ASC' | 'DESC'.
Partition_Specification
The partition specification provides information how the data inside the table is being partitioned and distributed.If the table index is being provided, the partition specification has to be provided as well. The syntax looks like:
Syntax
Partition_Specification := [ 'PARTITIONED' ['BY'] '(' Identifier_List ')' ] Distribution_Specification.
Identifier_List := Quoted_or_Unquoted_Identifier {',' Quoted_or_Unquoted_Identifier}.
Distribution_Specification := 'DISTRIBUTED' ['BY'] Distribution_Scheme ['INTO' integer_or_long_literal].
Distribution_Scheme := 'RANGE' '(' Sort_Item_List ')' | 'HASH' '(' Identifier_List ')' | 'DIRECT' 'HASH' '(' Identifier ')' | 'ROUND' 'ROBIN'.
For detailed semantics about partitioning see the section on partitioning.
U-SQL Table Partitions and Distributions
U-SQL Tables can be partitioned in two-levels: the higher-level coarse-grained partitioning into addressable partitions and the lower-level fine grained distribution within a table or a partition.
U-SQL Table Partition
Many use cases around data life cycle management, such as loading of daily or hourly data, require that one can add individual partitions and manage them separately. This is best done with partitioning a table into individually addressable partitions. Each of these partitions have to then explicitly be added with ALTER TABLE ADD PARTITION
and can be removed with ALTER TABLE DROP PARTITION
. In addition, the query processor will perform partition elimination on supported predicates.
Currently U-SQL only allows partitioning in conjunction with HASH
and ROUND ROBIN
distributions.
Note that the types of the columns used to specify partitioning schemes have to be comparable. In particular that means that the type cannot be a complex type. Additionally, in order to provide precise and deterministic semantics for partitioning on DateTime
values, the values used to partition on columns of DateTime
types have to have their DateTimeKind
set to DateTimeKind.Utc
.
U-SQL Table Distributions
Every table has to provide at least a distribution scheme that will partition the data inside the table (or inside each partition) according to the specified scheme.
Currently U-SQL supports four distribution schemes:
RANGE
Based on a set of ordered columns. Tables distributed by RANGE needs to provide a list of columns that will determine the ranges over which to distribute. The system will determine the bounds for the distributions. Each distribution contains all the rows between those bounds, according to the sort order given and an artificial MAX and MIN value to close the first and last distributions. For any distribution P that has a left boundary value L and a right boundary value R, P will contain the range from L inclusive to R-1. The value R will be in the next distribution. If the data is ordered in ascending order, then L corresponds to the lower bound and R to the upper bound, if the data is ordered in descending order then L corresponds to the high value and R to the low value. The column order is significant in determining the distribution and ASC/DESC optionally specifies the order and is defaulted to ascending.HASH
Based on a set of columns. Tables distributed by HASH requires a list of columns that will be used to route each row individually to a distributions based on a hash of the columns specified.DIRECT HASH
Based on single column of an integral type. The DIRECT HASH distribution scheme provides direct control of mapping a row into the distribution by using the value in the integer column as the distribution bucket id.ROUND ROBIN
ROUND ROBIN assigns rows to distributions individually in round robin fashion without reference to the values they contain. Each distribution should have approximately the same number of rows.
Note that each of the distribution schemes has certain advantages based on the characteristics of the data and the type of queries most frequently run against it. In particular, they should be chosen for distribution sizing, filter predicate selectivity and join comparisons to reduce data reshuffling in case of distribution misalignments.
For example, a range distribution is more likely helping if the query is looking for data in a single distribution using either a range or point query, and a hash distribution is beneficial if point queries are used, but they can introduce data skew if the data is unevenly distributed. Such data skew can hurt query performance if the query needs to query across many distributions. In that case ROUND ROBIN
will eliminate most of the data skew and provide better performance.
The INTO
clause specifies the number of buckets for the distribution schemes. The value has to be greater than or equal to 2 and less than or equal to 2500. If it is omitted, then U-SQL will give the query processor the flexibility to use as many distribution buckets as appropriate to avoid some costly reshuffling of data during the insertion. The query processor will choose the default number for hash distribution buckets depending on the size of the first data to be inserted into the table in the following way:
U-SQL has a short list of candidate bucket numbers ({ 2, 10, 20, 60, 120, 240, 480 }). It chooses the default from this list based on the estimated data size for the first insert and the upper bound average distribution size limit of 2GB. It uses the smallest number from the list which will produce the average distribution size smaller or equal to the 2GB upper bound.
If the INTO
clause is specified, then the data will be distributed into the specified number.
Examples
- The examples can be executed in Visual Studio with the Azure Data Lake Tools plug-in.
- The scripts can be executed locally. An Azure subscription and Azure Data Lake Analytics account is not needed when executed locally.
Basic Syntax
CREATE TABLE dbo.Employees
(
EmpID int,
EmpName string,
DeptID int,
Salary int?,
StartDate DateTime,
INDEX clx_EmpID CLUSTERED(EmpID ASC)
)
DISTRIBUTED BY HASH(EmpID);
Basic Syntax - Alternative Method
CREATE TABLE dbo.Employees
(
EmpID int,
EmpName string,
DeptID int,
Salary int?,
StartDate DateTime,
INDEX clx_EmpID CLUSTERED(EmpID ASC) DISTRIBUTED BY HASH(EmpID)
);
Partitioned Table
This examples creates a partitioned table. Data will be distributed over OrderID, CustomerID and partitioned by OrderDate.
CREATE DATABASE IF NOT EXISTS TestReferenceDB;
USE DATABASE TestReferenceDB;
DROP TABLE IF EXISTS dbo.Orders;
CREATE TABLE dbo.Orders
(
OrderID int,
CustomerID int,
OrderDetailID int,
OrderTotal double,
OrderDate DateTime,
INDEX clx_OrderID_CustomerID CLUSTERED(OrderID, CustomerID ASC)
)
PARTITIONED BY (OrderDate)
DISTRIBUTED BY HASH (OrderID, CustomerID)
INTO 10;