SAMPLE Expression (U-SQL)
Summary
USQL supports samplers natively. The sample expression is needed to access some types of samplers.
Given an input rowset and arguments that are specific to the types of sampler being invoked, the sample expression outputs a rowset that is a statistical sample of the input. Moreover, the sample expression takes as input an optional identifier (column) that records the probability with which the corresponding row has been output.
All samplers execute in one pass on data and in parallel on portions of the input; they do not require partitioning, shuffle etc. Samplers have been implemented in a single pass over data and have very small memory footprint; log( SIZE(input), SIZE(output) ).
Syntax
Sample_Expression_Complex := 'SAMPLE' Rowset_Source Sampler_Details [Weight_Col]
Sampler_Details := 'UNIFORM' '(' row_fraction ')' | 'ON' Identifier_List 'UNIVERSE' '(' row_fraction ')' | 'ON' Identifier_List 'DISTINCT' '(' row_fraction ',' min_row_count ')'
Weight_Col := 'WITH' 'WEIGHT' 'AS' Identifier.
Remarks
UNIFORM
Rows are picked uniformly at random with probability equal torow_fraction
. The weight column, if requested, is set to 1/(row_fraction
) for all rows. The size of the output is governed by a binomial distribution. In expectation, the size of output isrow_fraction
* SIZE(input rowset).row_fraction
A double between 0 and 1 that indicates the probability with which a row in the input rowset will be passed.Identifier_List
This option specifies the list of columns that define the groups.
Syntax
Identifier_List := Quoted_or_Unquoted_Identifier {',' Quoted_or_Unquoted_Identifier }.
UNIVERSE
All rows are passed where the group value (i.e., the value of the columns inIdentifier_List
) is in some randomly chosenrow_fraction
of the space of group values.UNIVERSE
uses a cryptographically strong hash function to pick a random portion of the values. The weight column, if requested, is set to 1/(row_fraction
) for all rows.UNIVERSE
ensures sample-then-join is equivalent to join-then-sample: UsingUNIVERSE
before an equijoin withIdentifier_List
, on both inputs of the equijoin, containing exactly the equijoin columns is identical to sampling after the join.The size of the output is also in expectation
row_fraction
*SIZE(input rowset). However, especially if there are too few groups, the output size has more variance than withUNIFORM
since all rows from a group are passed by the sampler or not.DISTINCT
Per group (a distinct value of the columns inIdentifier_List
), this sampler passesmin_row_count
rows and the rest of the rows are passed with probabilityrow_fraction
. The weight column, if requested, is set to 1/(row_fraction
) if the row is passed in the probabilistic mode and 1 otherwise.DISTINCT
facilitates sample-before-groupby: UsingDISTINCT
before a groupby, withIdentifier_List
containing at least the group in the group-by, guarantees that the sample will not miss any groups.DISTINCT
may return fewer thanmin_row_count
rows (but never less than 1) for some groups. A simple case is when a group has fewer thanmin_row_count
rows in the input. A more complex case occurs depending on the degree of parallelism of the stage that runs the sampler and how the rows corresponding to the group are distributed among the input partitions.
Tip
Ensure your sample probability, row_fraction
, is adequate for the size of your dataset to minimize the possibility of an empty result set being returned.
Examples
- The examples can be executed in Visual Studio with the Azure Data Lake Tools plug-in.
- The scripts can be executed locally. An Azure subscription and Azure Data Lake Analytics account is not needed when executed locally.
- The examples below are based on the dataset defined below. Ensure your execution includes the rowset variable,
@data
.
Dataset
@data =
SELECT * FROM
( VALUES
("02143", 30, 100, "Smith"),
("98052", 25, 50, "Andersen"),
("02139", 30, 75, "Todd"),
("02139", 25, 60, "Roberts"),
("61801", 23, 80, "Sanders")
) AS T(ZipCode, Age, Salary, LastName);
UNIFORM
@result =
SAMPLE @data UNIFORM (0.4) WITH WEIGHT AS _Weight;
OUTPUT @result
TO "/ReferenceGuide/QSE/Sample/uniform.txt"
USING Outputters.Tsv();
Possible output below. Note that the output need not contain every ZipCode
or all rows that have a given Age
value. The likelihood of seeing this output is 0.03456.
ZipCode | Age | Salary | LastName | Weight |
---|---|---|---|---|
02143 | 30 | 100 | Smith | 2.5 |
98052 | 25 | 50 | Andersen | 2.5 |
UNIVERSE
Samples the @data
rowset by picking some randomly chosen fraction of the overall value-space of the ON columns and returning all rows whose value of the ON columns belongs in the chosen space. The query implicitly picks 10% of all the values of the Age
columns and returns all rows whose value of Age
was randomly chosen.
@result =
SAMPLE @data ON Age UNIVERSE (0.1);
OUTPUT @result
TO "/ReferenceGuide/QSE/Sample/universe.txt"
USING Outputters.Tsv();
Possible output below. Note that all rows with Age
=30 will be picked if any of the Age
=30 rows are picked. Similar all rows with Age
=25 will either be picked or not be picked (as in this example).
The likelihood of seeing this output is 0.081; this is one of the three outcomes that have the same probability after no-output which occurs with a likelihood of 0.729 since there are only three unique values for Age
. Note: An empty result set may be returned in this example due to the low probability and small dataset.
ZipCode | Age | Salary | LastName |
---|---|---|---|
02143 | 30 | 100 | Smith |
02139 | 30 | 75 | Todd |
DISTINCT
Samples the @data
rowset such that at least 3 rows per distinct value of the ZipCode
column are included, and additional rows are added with the indicated probability (10%). The weight of each passing row is added to the resulting rowset in the weight column _Weight
.
@result =
SAMPLE @data ON ZipCode DISTINCT (0.1, 3) WITH WEIGHT AS _Weight;
OUTPUT @result
TO "/ReferenceGuide/QSE/Sample/distinct.txt"
USING Outputters.Tsv();
Possible output below. Note that every ZipCode
is represented in the output by up to 3 rows, if as many rows are available in the input. The likelihood of seeing this output is 1.
ZipCode | Age | Salary | LastName | Weight |
---|---|---|---|---|
02143 | 30 | 100 | Smith | 1 |
98052 | 25 | 50 | Andersen | 1 |
02139 | 30 | 75 | Todd | 1 |
02139 | 25 | 60 | Roberts | 1 |
61801 | 23 | 80 | Sanders | 1 |