Extractor Parameters (U-SQL)

Summary

Parameters
 ● delimiter
 ● encoding
 ● escapeCharacter
 ● nullEscape
 ● quoting
 ● rowDelimiter
 ● silent
 ● skipFirstNRows
 ● charFormat

The supported parameters and their defaults are:

Parameter name Parameter type Default value
delimiter char ',' (comma)

This parameter specifies the column separator character that separates columns in the file. The default column separator for is a comma (','). It can be any valid Unicode character including those that are represented with multi-byte encodings in any of the Unicode-Transfer-Formats (i.e. UTF-*). The delimiter parameter is not available for Extractors.Csv() and Extractors.Tsv().

Important

If the quoting parameter is set to false, the delimiter character inside a quoted string is being used as a column separator and may lead to incorrect or failing extractions.


Parameter name Parameter type Default Value
encoding System.Text.Encoding Encoding.UTF8

Per default, files are assumed to be stored in UTF-8 encoding. However, some files may be stored using a different encoding. The encoding parameter provides the option to specify the file’s actual encoding and also translate non-UTF-8 encoded files.

The supported encodings are:

Encoding Property Description
Encoding.[ASCII] Assumes that the file is encoded using the ASCII (7-bit) character set. Since ASCII is an all uppercase keyword, it needs to be quoted to not conflict with U-SQL’s reserved keyword rule.
Encoding.BigEndianUnicode Assumes that the file is encoded using the UTF-16 format that uses the big endian byte order.
Encoding.Unicode Assumes that the file is encoded using the UTF-16 format using the little endian byte order.
Encoding.UTF7 Assumes that the file is encoded using the UTF-7 format.
Encoding.UTF8 Assumes that the file is encoded using the UTF-8 format. This is the default.
Encoding.UTF32 Assumes that the file is encoded using the UTF-32 format using the little endian byte order.

Important

System.Text.Encoding.Default is not supported, since it would use the processing node’s operating system’s current ANSI code page that cannot be controlled by the user and is not the same as the default of the extractor. For more details see System.Text.Encoding.

If the file contains a code point that is invalid for the specified encoding, a runtime error during extraction will occur regardless of the silent parameter setting that will indicate the offending data value.


Parameter name Parameter type Default value
escapeCharacter char? null

The escapeCharacter parameter – if not set to null – specifies the character in the file that is used to escape itself and all delimiter values in the file.

If the escape character is followed by a value other than itself or any of the delimiter values, the escape character is dropped when reading the value.

If it set to null, there is no escape character specified.

The escapeCharacter parameter will be applied regarding of whether quoting is enabled or not. It however will not be used to escape the quoting character. The quoting character will get escaped with double-quotes in alignment with the Excel CSV behaviour.


Parameter name Parameter type Default value
nullEscape string null

The nullEscape parameter specifies the string in a column that is used to represent the null value. It will be applied regardless of whether quoting is enabled or not.

If the textual input of the field is not exactly equal the nullEscape string sequence, then the string sequence will not be interpreted as a null value and it will be treated like any other input string.

If the textual input is the zero-length string and the target type is a nullable non-string type, then it will also be interpreted as NULL.

If nullEscape is set to null (the default), no special null value is being interpreted. A zero-length textual representation (before removing quotes) will be interpreted as a zero-length string for target type string, and null for all non-string nullable types. A zero-length string with quotes will be a zero-length string for both type categories.

If the target type of the extraction is not a nullable type and the value is interpreted as null, an E_RUNTIME_USER_EXTRACT_COLUMN_CONVERSION_NULL_ERROR error will be raised.

The following table gives some examples on how different inputs are being interpreted based on the target types for the fields (using int? as the representative type for nullable non-string types) and the value for null_escape (ERR indicates a runtime conversion error) using the row value

1,,"",\N,#NULL#,"Some \N fun"

nullEscape String int?
"\N" "1","","",null,"#NULL#","Some \N fun" 1,null,ERR,null,ERR, ERR
"" "1","",null,"\N","#NULL#","Some \N fun" 1,null,null,ERR,ERR,ERR
null "1",null,"","\N","#NULL#","Some \N fun" 1,null,ERR,ERR,ERR,ERR
"#NULL#" "1","","","\N","#NULL#","Some \N fun" 1,null,ERR,ERR,null,ERR

Parameter name Parameter type Default value
quoting bool true

The quoting parameter if set to true indicates that the extractor should consider " (double-quote) as a column field quotation which inhibits the interpretation of the column delimiters inside the quoted field. A double-quote inside a quoted field needs to be escaped by doubling it as "" in the file.

Important

Due to the parallel processing and splitting of input files, quoting cannot inhibit the interpretation of the row delimiters. Thus for example an input Excel CSV file that contains an unescaped carriage return or linefeed will cause the extraction to fail even if these delimiters are quoted. They will have to be escaped. Currently no other quoting characters are supported.

Tip

If the input data does not use quoting, specifying false for the quoting parameter will be resulting in a faster extraction.


Parameter name Parameter type Default values
rowDelimiter string (max length=1) "\r\n" (carriage return, linefeed)
"\r" (carriage return)
"\n" (linefeed)

This parameter specifies the row separator character sequence that separates rows in the file. If the rowDelimiter is set to null, the default values are being used which are carriage return followed by linefeed, or carriage return or linefeed.

If the length of the rowDelimiter is more than one UCS-4 characters, an error is raised**.**

The comparison of the rowDelimiter will be done using a byte-wise comparison after applying the encoding got applied.

Note that the rowDelimiter character inside a quoted string will not be escaped and will be used as a row separator which will lead to incorrect or failing extractions.


Parameter name Parameter type Default Value
silent bool false

This parameter tells the extractor to ignore and skip rows that have a different number of columns than the requested number of columns. It also replaces corrupt column data with null when the datatype is nullable (for example, it replaces "int?" with null when there is a non-digit character in the column). If the column type is not nullable, an error is raised.

Rows are checked for the correct number of columns first. For example, if there is corrupt data in a column with the wrong number of rows, the extractor skips the row. There is no conversion. If the row has the correct number of columns, however, the extractor doesn't skip it. The conversion occurs (assuming that the column's type is nullable).

Important

silent does not ignore encoding errors since such errors often indicate that the file encoding and the specified encoding are not compatible and thus could lead to more severe data corruption.


Parameter name Parameter type Default Value
skipFirstNRows Int 0

The parameter specifies the number of rows to skip. The rows being skipped do not need to conform to the column schema in either type or count of columns. Note, you can only skip rows that are in the first segment of a file; otherwise, an error will be raised.


Parameter name Parameter type Default Value
charFormat string "uint16"

The supported values are:

Value Description
uint16 or null This is the default value. Serializes the char value as an integral number (taking all other serialization options into account) and parses the input as the integral character code number to the corresponding character or errors if the input is not an integral character code (or mapable null if extracting it as char?).
string Serializes the char value in its Unicode string representation (taking all other serialization options including encoding into account) and parses the input as the character codepoint using the specified encoding.

See Also