SAMPLE (U-SQL)
Summary
U-SQL supports samplers natively. The SAMPLE
clause enables convenient access from within a SELECT
expression. All samplers execute in one pass on data and in parallel on portions of the input; they do not require partitioning, shuffle etc. Samplers have been implemented in a single pass over data and have very small memory footprint; log( SIZE(input), SIZE(output) ).
Syntax
Sample_Expression_Simple := Query_Expression 'SAMPLE' ( 'ANY' '(' row_count ')' | 'UNIFORM' '(' row_fraction ')' ).
Remarks
ANY
Anyrow_count
rows from the input rowset.ANY
does not return a randomly chosen subset of rows; do not use where randomness is needed.UNIFORM
Rows are picked uniformly at random with probability equal torow_fraction
. The size of the output is governed by a binomial distribution. In expectation, the size of output isrow_fraction
* SIZE(input rowset).row_fraction
A double between 0 and 1 that indicates the probability with which a row in the input rowset will be passed.
Tip
Ensure your sample probability, row_fraction
, is adequate for the size of your dataset to minimize the possibility of an empty result set being returned.
Examples
- The examples can be executed in Visual Studio with the Azure Data Lake Tools plug-in.
- The scripts can be executed locally. An Azure subscription and Azure Data Lake Analytics account is not needed when executed locally.
- The examples below are based on the dataset defined below. Ensure your execution includes the rowset variable,
@data
.
Dataset
@data =
SELECT * FROM
( VALUES
("02143", 30, 100, "Smith"),
("98052", 25, 50, "Andersen"),
("02139", 30, 75, "Todd"),
("02139", 25, 60, "Roberts"),
("61801", 23, 80, "Sanders")
) AS T(ZipCode, Age, Salary, LastName);
ANY
Samples the @data
rowset by selecting the specified number of rows (2) in a completely arbitrary way.
@result =
SELECT * FROM @data SAMPLE ANY (2);
OUTPUT @result
TO "/ReferenceGuide/QSE/Sample/any.txt"
USING Outputters.Tsv();
Possible output below.
ZipCode | Age | Salary | LastName |
---|---|---|---|
02143 | 30 | 100 | Smith |
98052 | 25 | 50 | Andersen |
UNIFORM
Samples the @data
rowset using a random uniform sampling with the provided probability (10%).
@result =
SELECT * FROM @data SAMPLE UNIFORM (0.1);
OUTPUT @result
TO "/ReferenceGuide/QSE/Sample/simple_uniform.txt"
USING Outputters.Tsv();
Possible output below. The likelihood of this output is 0.00729 (two rows are picked with a probability of 0.1 each and three rows not picked with a probability of 0.9 each). Note: An empty result set may be returned in this example due to the low probability and small dataset.
ZipCode | Age | Salary | LastName |
---|---|---|---|
02143 | 30 | 100 | Smith |
98052 | 25 | 50 | Andersen |