User-defined functions
From: Developing big data solutions on Microsoft Azure HDInsight
You can create user-defined functions (UDFs) and libraries of UDFs for use with HDInsight queries and transformations. Typically, the UDFs are written in Java and they can be referenced and used in a Hive or Pig script, or (less common) in custom map/reduce code. You can write UDFs in Python for use with Pig, but the techniques are different from those described in this topic.
UDFs can be used not only to centralize code for reuse, but also to perform tasks that are difficult (or even impossible) in the Hive and Pig scripting languages. For example, a UDF could perform complex validation of values, concatenation of column values based on complex conditions or formats, aggregation of rows, replacement of specific values with nulls to prevent errors when processing bad records, and much more.
The topics covered here are:
- Creating and using UDFs with Hive
- Creating and using UDFs with Pig
Creating and using UDFs with Hive
In Hive, UDFs are often referred to a plugins. You can create three different types of Hive plugin. A standard UDF is used to perform operations on a single row of data, such as transforming a single input value into a single output value. A user-defined table generating function (UDTF) is used to perform operations on a single row, but can produce multiple output rows. A user-defined aggregating function (UDAF) is used to perform operations on multiple rows, and it outputs a single row.
Each type of UDF extends a specific Hive class in Java. For example, the standard UDF must extend the built-in Hive class UDF or GenericUDF and accept text string values (which can be column names or specific text strings). As a simple example, you can create a UDF named your-udf-name as shown in the following code.
package your-udf-name.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
public final class your-udf-name extends UDF {
public Text evaluate(final Text s) {
// Implementation here.
// Return the result.
}
}
The body of the UDF, the evaluate function, accepts one or more string values. Hive passes the values of columns in the dataset to these parameters at runtime, and the UDF generates a result. This might be a text string that is returned within the dataset, or a Boolean value if the UDF performs a comparison test against the values in the parameters. The arguments must be types that Hive can serialize.
After you create and compile the UDF, you upload it to HDInsight at the start of the Hive session using a script with the following command.
add jar /path/your-udf-name.jar;
Alternatively, you can upload the UDF to a shared library folder when you create the cluster, as described in Automating cluster management with PowerShell. Then you must register it using a command such as the following.
CREATE TEMPORARY FUNCTION local-function-name
AS 'package-name.function-name';
You can then use the UDF in your Hive query or transformation. For example, if the UDF returns a text string value you can use it as shown in the following code to replace the value in the specified column of the dataset with the value generated by the UDF.
SELECT your-udf-name(column-name) FROM your-data
If the UDF performs a simple task such as reversing the characters in a string, the result would be a dataset where the value in the specified column of every row would have its contents reversed.
However, the registration only makes the UDF available for the current session, and you will need to re-register it each time you connect to Hive.
Note
For more information about creating and using standard UDFs in Hive, see HivePlugins on the Apache website. For more information about creating different types of Hive UDF, see User Defined Functions in Hive, Three Little Hive UDFs: Part 1, Three Little Hive UDFs: Part 2, and Three Little Hive UDFs: Part 3 on the Oracle website.
Creating and using UDFs with Pig
You can create different types of UDF for use with Pig. The most common is an evaluation function that extends the class EvalFunc. The function must accept an input of type Tuple, and return the required result (which can be null). The following outline shows a simple example.
package your-udf-package;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class your-udf-name extends EvalFunc<String>
{
public String exec(Tuple input) {
// Implementation here.
// Return the result.
}
}
After you create and compile the UDF, you upload it to HDInsight at the start of the Pig session with the following command.
add jar /path/your-udf-name.jar;
Alternatively, you can upload the UDF to a shared library folder when you create the cluster, as described in Automating cluster management with PowerShell. The REGISTER command at the start of a Pig script will then make the UDF available and you can use it in your Pig queries and transformations. For example, if the UDF returns the lower-cased equivalent of the input string, you can use it as shown in this query to generate a list of lower-cased equivalents of the text strings in the first column of the input data.
REGISTER 'your-udf-name.jar';
A = LOAD 'your-data' AS (column1: chararray, column2: int);
B = FOREACH A GENERATE your-udf-name.function-name(column1);
DUMP B;
A second type of UDF in Pig is a filter function that you can use to filter data. A filter function must extend the class FilterFunc, accept one or more values as Tuple instances, and return a Boolean value. The UDF can then be used to filter rows based on values in a specified column of the dataset. For example, if a UDF named IsShortString returns true for any input value less than five characters in length, you could use the following script to remove any rows where the first column has a value less than five characters.
REGISTER 'your-udf-name.jar';
A = LOAD 'your-data' AS (column1: chararray, column2: int);
B = FOREACH A GENERATE your-udf-name.function-name(column1);
C = FILTER B BY not IsShortString(column1)
DUMP C;
Note
For more information about creating and using UDFs in Pig, see the Pig UDF Manual on the Apache Pig website.