Configuring and debugging solutions
From: Developing big data solutions on Microsoft Azure HDInsight
HDInsight clusters are automatically configured when they are created, and you should not attempt to change the configuration of a cluster itself by editing the cluster configuration files. You can set a range of properties that you require for the cluster when you create it, but in general you will set specific properties for individual jobs at runtime. You may also need to debug your queries and transformations if they fail or do not produce the results you expect.
Runtime job configuration
You may need to change the cluster configuration properties for a job, such as a query or transformation, before you execute it. HDInsight allows configuration settings (or configurable properties) to be specified at runtime for individual jobs.
As an example, in a Hive query you can use the SET statement to set the value of a property. The following statements at the start of a query set the compression options for that job.
SET mapreduce.map.output.compress=true;
SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
SET mapreduce.output.compression.type=BLOCK;
SET hive.exec.compress.intermediate=true;
Alternatively, when executing a Hadoop command, you can use the -D parameter to set property values, as shown here.
hadoop [COMMAND] –D property=value
The space between the “-D” and the property name can be omitted. For more details of the options you can use in a job command, see the Apache Hadoop Commands Manual.
Note
For information about how you can set the configuration of a cluster for all jobs, rather than configuring each job at runtime, see Custom cluster management clients.
Debugging and testing
Debugging and testing a Hadoop-based solution is more difficult than a typical local application. Executable applications that run on the local development machine, and web applications that run on a local development web server such as the one built into Visual Studio, can easily be run in debug mode within the integrated development environment. Developers use this technique to step through the code as it executes, view variable values and call stacks, monitor procedure calls, and much more.
None of these functions are available when running code in a remote cluster. However, there are some debugging techniques you can apply. This section contains information that will help you to understand how to go about debugging and testing your solutions:
- Writing out significant values or messages during execution. You can add extra statements or instructions to your scripts or components to display the values of variables, export datasets, write messages, or increment counters at significant points during the execution.
- Obtaining debugging information from log files. You can monitor log files and standard error files for evidence that will help you locate failures or problems.
- Using a single-node local cluster for testing and debugging. Running the solution in a local or remote single-node cluster can help to isolate issues with parallel execution of mappers and reducers.
Hadoop jobs may fail for reasons other than an error in the scripts or code. The two primary reasons are timeouts and unhandled errors due to bad input data. By default, Hadoop will abandon a job if it does not report its status or perform I/O activity every ten minutes. Typically most jobs will do this automatically, but some processor-intensive tasks may take longer.
If a job fails due to bad input data that the map and reduce components cannot handle, you can instruct Hadoop to skip bad records. While this may affect the validity of the output, skipping small volumes of the input data may be acceptable. For more information, see the section “Skipping Bad Records” in the MapReduce Tutorial on the Apache website.
Writing out significant values or messages during execution
A traditional debugging technique is to simply output values as a program executes to indicate progress. It allows developers to check that the program is executing as expected, and helps to isolate errors in the code. You can use this technique in HDInsight in several ways:
- If you are using Hive you might be able to split a complex script into separate simpler scripts, and display the intermediate datasets to help locate the source of the error.
- If you are using a Pig script you can:
- Dump messages and/or intermediate datasets generated by the script to disk before executing the next command. This can indicate where the error occurs, and provide a sample of the data for you to examine.
- Call methods of the EvalFunc class that is the base class for most evaluation functions in Pig. You can use this approach to generate heartbeat and progress messages that prevent timeouts during execution, and to write to the standard log file. See Class EvalFunc<T> on the Pig website for more information.
- If you are using custom map and reduce components you can write debugging messages to the standard output file from the mapper and then aggregate them in the reducer, or generate status messages from within the components. See the Reporter class on the Apache website for more information.
- If the problem arises only occasionally, or on only one node, it may be due to bad input data. If skipping bad records is not appropriate, add code to your mapper class that validates the input data and reports any errors encountered when attempting to parse or manipulate it. Write the details and an extract of the data that caused the problem to the standard error file.
- You can kill a task while it is operating and view the call stack and other useful information such as deadlocks. To kill a job, execute the command kill -QUIT [job_id]. The job ID can be found in the Hadoop YARN Status portal. The debugging information is written to the standard output (stdout) file.
Note
For information about the Hadoop YARN Status portal, see Monitoring and logging.
Obtaining debugging information from log files
The core Hadoop engine in HDInsight generates a range of information in log files, counters, and status messages that is useful for debugging and testing the performance of your solutions. Much of this information is accessible through the Hadoop YARN Status portal.
The following list contains some suggestions to help you obtain debugging information from HDInsight:
- Use the Applications section of the Hadoop YARN Status portal to view the status of jobs. Select FAILED in the menu to see failed jobs. Select the History link for a job to see more details. In the details page are menu links to show the job counters, and details of each map and reduce task. The Task Details page shows the errors, and provides links to the log files and the values of custom and built-in counters.
- View the history, job configuration, syslog, and other log files. The Tools section of the Hadoop YARN Status portal contains a link that opens the log files folder where you can view the logs, and also a link to view the current configuration of the cluster. In addition, see Monitoring and logging in the section Building end-to-end solutions using HDInsight.
- Run a debug information script automatically to analyze the contents of the standard error, standard output, job configuration, and syslog files. For more information about running debug scripts, see How to Debug Map/Reduce Programs and Debugging in the MapReduce Tutorial on the Apace website.
Using a single-node local cluster for testing and debugging
Performing runtime debugging and single-stepping through the code in map and reduce components is not possible within the Hadoop environment. If you want to perform this type of debugging, or run unit tests on components, you can create an application that executes the components locally, outside of Hadoop, in a development environment that supports debugging. You can then use a mocking framework and test runner to perform unit tests.
Sometimes errors can occur due to the parallel execution of multiple map and reduce components on a multi-node cluster. Consider emulating distributed testing by running multiple jobs on a single node cluster at the same time to detect errors, then expand this approach to run multiple jobs concurrently on clusters containing more than one node in order to help isolate the issue.
You can create a single-node HDInsight cluster in Azure by specifying the advanced option when creating the cluster. Alternatively you can install a single-node development environment on your local computer and execute the solution there. A single-node local development environment for Hadoop-based solutions that is useful for initial development, proof of concept, and testing is available from Hortonworks. For more details, see Hortonworks Sandbox.
By using a single-node local cluster you can rerun failed jobs and adjust the input data, or use smaller datasets, to help you isolate the problem. To rerun the job go to the \taskTracker\task-id\work folder and execute the command % hadoop org.apache.hadoop.mapred.IsolationRunner ../job.xml. This runs the failed task in a single Java virtual machine over the same input data.
You can also turn on debugging in the Java virtual machine to monitor execution. More details of configuring the parameters of a Java virtual machine can be found on several websites, including Java virtual machine settings on the IBM website.