Partitioning Data into Training and Testing Sets (Analysis Services - Data Mining)
Separating data into training and testing sets is an important part of evaluating data mining models. Typically, when you partition a data set into a training set and testing set, most of the data is used for training, and a smaller portion of the data is used for testing. Analysis Services randomly samples the data to help ensure that the testing and training partitions are similar. By using similar data for training and testing, you can minimize the effects of data discrepancies and better understand the characteristics of the model.
After a model has been processed by using the training set, you test the model by making predictions against the test set. Because the data in the testing set already contains known values for the attribute that you want to predict, it is easy to determine whether the model's guesses are correct.
Typically, the predictive accuracy of a mining model is measured by either lift or classification accuracy. For more information about lift charts and other accuracy charts, see Tools for Charting Model Accuracy (Analysis Services - Data Mining).
Creating Partitions for Data Mining Structures
In SQL Server 2008, you partition data at the level of the mining structure. The information about the partition size and the data in each partition is stored with the structure, and all the models that are based on that structure can use the partitions for training and testing.
You can define a partition on a mining structure in the following ways:
Using the Data Mining Wizard to partition a mining structure when you create the mining structure.
Modifying structure properties in the Mining Structure tab of the Data Mining Designer.
Creating and modifying structures programmatically by using Analysis Management Objects (AMO) or XML Data Definition Language (DDL).
Using the Data Mining Wizard to Partition a Mining Structure
By default, after you have defined the data sources for a mining structure, the Data Mining Wizard will divide the data into partitions of 70 percent for training and 30 percent for testing. This is a ratio often used in data mining, but with Analysis Services you can change this ratio to suit your requirements.
You can also configure the wizard to set a maximum number of training cases, or you can combine the limits to allow a maximum percentage of cases up to a specified maximum number of cases. When you specify both a maximum percentage of cases and a maximum number of cases, Analysis Services uses the smaller of the two limits as the size of the test set. For example, if you specify 30 percent holdout for the testing cases, and the maximum number of test cases as 1000, the size of the test set will never exceed 1000 cases. This can be useful if you want to ensure that the size of your test set stays consistent even if more training data is added to the model.
If you use the same data source view for different mining structures, and want to ensure that the data is partitioned in roughly the same way for all mining structures and their models, you should specify the seed that is used to initialize random sampling. When you specify a value for HoldoutSeed, Analysis Services will use that value to begin sampling. Otherwise, sampling uses a hashing algorithm on the name of the mining structure to create the seed value.
Note
If you create a copy of the mining structure by using the EXPORT and IMPORT statements, the new mining structure will have the same partition definition, because the export process creates a new ID but uses the same name. However, if two mining structures use the same underlying data source but have different names, the partitions that are created for each mining structure will be different.
Modifying Structure Properties
If you create and process a mining structure, and then later decide that you want to add a test partition, you can modify the properties of the mining structure. To change the way that data is partitioned, edit the following properties:
Property |
Description |
---|---|
HoldoutMaxCases |
Specifies the maximum number of cases to include in the testing set. |
HoldoutMaxPercent |
Specifies the number of cases to include in the testing set as a percentage of the complete data set. To have no data set, you would specify 0. |
HoldoutSeed |
Specifies an integer value to use as seed when randomly selecting data for the partitions. This value does not affect the number of cases in the training set; instead, it ensures that the partition can be repeated. |
If you add or change a partition to an existing structure, you must reprocess the structure and all associated models. Also, because adding a partition causes the model to be trained on a different subset of the data, you might see different results from your model.
Specifying HOLDOUT Programmatically
You can create a partitioned data mining structure by using DMX statements, AMO, or XML DDL.
DMX In the Data Mining Extensions (DMX) language, the CREATE MINING STRUCTURE statement has been extended to include a WITH HOLDOUT clause. For syntax and examples of the CREATE STRUCTURE statement, see CREATE MINING STRUCTURE (DMX).
Note
The ALTER MINING STRUCTURE statement does not support the use of holdout parameters.
ASSL You can both create new partitioned mining structures and add partitions to existing data mining structures, by using the Analysis Services Scripting Language (ASSL). For more information, see MiningStructure Element (ASSL).
AMO You can also view and modify partitions by using AMO. For more information, see AMO Concepts and Object Model.
You can view information about the partitions in an existing mining structure by querying the data mining schema rowset. You can do this by making a DISCOVER ROWSET call, or you can use a DMX query. For more information, see Data Mining Schema Rowsets or Querying the Data Mining Schema Rowsets (Analysis Services - Data Mining).
Using Partition Information
By default, all information about the training and test partitions is cached, so that you can use existing partitions to train and then test new models. You can also define filters to apply to the cached holdout partitions so that you can evaluate the model on subsets of the data. For more information, see Creating Filters for Mining Models (Analysis Services - Data Mining).
The way that cases are divided into partition depends on the way that you configure holdout, and the data that you provide. If you want to determine the number of cases in each partition, or find details about the cases included in the training and test sets, you can query the model structure by creating a DMX query. For example, the following query returns the cases that were used in the training set of the model.
SELECT * from <structure>.CASES WHERE IsTrainingCase()
To retrieve only the test cases, and additionally filter the test cases on one of the columns in the mining structure, use the following syntax:
SELECT * from <structure>.CASES WHERE IsTestCase() AND <structure column name> = '<value>'
Limitations on the Use of Holdout
To use holdout, the MiningStructureCacheMode property of the mining structure must be set to the default value, KeepTrainingCases. If you change the CacheMode property to ClearAfterProcessing, and then reprocess the mining structure, the partition will be lost.
You cannot use partitions with time series models. Therefore, partitioning is disabled if you create a partition and specify that the Microsoft Time Series algorithm be used to create the model. Partitioning is also disabled if the mining structure contains a KEY TIME column at either the case or nested table level.
You can inadvertently configure the partitions to use the complete data set for testing, and use no data for training. However, Analysis Services will raise an error so that you can correct the problem. Analysis Services also warns you when the structure is processed if more than 50 percent of the data has been held out for testing.
In most cases, the default holdout value of 30 provides a good balance between training and testing data. There is no simple way to determine how large the data set should be to provide sufficient training, or how small the training set should be to avoid overfitting. However, after you have built a model, you can use cross-validation to assess the data set with respect to a particular model. For more information, see Cross-Validation (Analysis Services - Data Mining).
In addition to the properties listed in the previous table, a read-only property, HoldoutActualSize, is provided in AMO and XML DDL. However, because the actual size of a partition cannot be determined accurately until after the structure has been processed, you should check whether the model has been processed before you retrieve the value of the HoldoutActualSize property.