Querying a Logistic Regression Model (Analysis Services - Data Mining)

When you create a query against a data mining model, you can create a content query, which provides details about the patterns discovered in analysis, or you can create a prediction query, which uses the patterns in the model to make predictions using new data.

This section explains how to create queries for models that are based on the Microsoft Logistic Regression algorithm.

Content Queries

Retrieving Model Parameters by Using the Data Mining Schema Rowset

Finding Additional Detail about the Model by Using DMX

Prediction Queries

Making Predictions for a Continuous Value

Making Predictions for a Discrete Value

Finding Information about a Logistic Regression Model

Logistic regression models are created by using the Microsoft Neural Network algorithm with a special set of parameters; therefore, a logistic regression model has some of the same information as a neural networks model, but is less complex. To understand the structure of the model content, and which node types store what kind of information, see Mining Model Content for Logistic Regression Models (Analysis Services - Data Mining).

To follow along in the query scenarios, you can create a logistic regression model as described in the following section of the Intermediate Data Mining Tutorial:.Lesson 5: Building Neural Network and Logistic Regression Models (Intermediate Data Mining Tutorial).

You can also use the mining structure, Targeted Mailing, from the Basic Data Mining Tutorial. Then, add a logistic regression model by running the following DMX script:

ALTER MINING STRUCTURE [Targeted Mailing]
ADD MINING MODEL [TM_Logistic Regression]
([Customer Key],
[Age],
[Bike Buyer] PREDICT,
[Yearly Income] PREDICT,
[Commute Distance],
[English Education],
Gender,
[House Owner Flag],
[Marital Status],
[Number Cars Owned],
[Number Children At Home],
[Region],
[Total Children]
)
USING Microsoft_Logistic_Regression

Sample Query 1: Retrieving Model Parameters by Using the Data Mining Schema Rowset

By querying the data mining schema rowset, you can find metadata about the model, such as when it was created, when the model was last processed, the name of the mining structure that the model is based on, and the name of the column used as the predictable attribute. The following example returns the parameters that were used when the model was first created, together with the name and type of the model, and the date that it was created.

SELECT MODEL_NAME, SERVICE_NAME, DATE_CREATED, MINING_PARAMETERS 
FROM $system.DMSCHEMA_MINING_MODELS
WHERE MODEL_NAME = 'Call Center_LR'

Sample results:

MODEL_NAME

SERVICE_NAME

DATE_CREATED

MINING_PARAMETERS

Call Center_LR

Microsoft_Logistic_Regression

04/07/2009 20:38:33

HOLDOUT_PERCENTAGE=30, HOLDOUT_SEED=1, MAXIMUM_INPUT_ATTRIBUTES=255, MAXIMUM_OUTPUT_ATTRIBUTES=255, MAXIMUM_STATES=100, SAMPLE_SIZE=10000

Back to top

Sample Query 2: Finding Additional Detail about the Model by Using DMX

The following query returns some basic information about the logistic regression model. A logistic regression model is similar to a neural network model in many ways, including the presence of a marginal statistic node (NODE_TYPE = 24) that describes the values used as inputs. This example query uses the Targeted Mailing model, and gets the values of all the possible inputs by retrieving them from the nested table, NODE_DISTRIBUTION.

SELECT FLATTENED NODE_DISTRIBUTION AS t
FROM [TM_Logistic Regression].CONTENT 

Partial results:

t.ATTRIBUTE_NAME

t.ATTRIBUTE_VALUE

t.SUPPORT

t.PROBABILITY

t.VARIANCE

t.VALUETYPE

Age

Missing

0

0

0

1

Age

45.43491192

17484

1

126.9544114

3

Bike Buyer

Missing

0

0

0

1

Bike Buyer

0

8869

0.507263784

0

4

Bike Buyer

1

8615

0.492736216

0

4

Commute Distance

Missing

0

0

0

1

Commute Distance

5-10 Miles

3033

0.173472889

0

4

The actual query returns many more rows; however, this sample illustrates the type of information that is provided about the inputs. For example, each possible value for a discrete value is listed in the table, whereas continuous inputs are discretized. For more information about how to use the information in the marginal statistics node, see Mining Model Content for Logistic Regression Models (Analysis Services - Data Mining).

Note

The results have been flattened for easier viewing, but you can return the nested table in a single column if your provider supports hierarchical rowsets. For more information, see Hierarchical Rowsets in the OLE DB Programmer's Guide.

Back to top

Making Predictions from a Logistic Regression Model

You can use the Predict (DMX) function with every kind of mining model to provide new data and make predictions. You can also use functions to return additional information about the prediction, such as the probability that a prediction is correct. This section provides some examples of prediction queries on a logistic regression model.

Sample Query 3: Making Predictions for a Continuous Value

Because logistic regression supports the use of continuous attributes for both input and prediction, it is easy to create models that correlate various factors in your data. You can use prediction queries to explore the relationship among these factors.

The following query sample is based on the Call Center model, and creates a singleton query that predicts service grade for the Friday AM shift. The PredictHistogram (DMX) function returns a nested table that provides statistics relevant to understanding the validity of the predicted value.

SELECT
  Predict([Call Center_LR].[Service Grade]) as Predicted ServiceGrade,
  PredictHistogram([Call Center_LR].[Service Grade]) as [Results],
FROM
  [Call Center_LR]
NATURAL PREDICTION JOIN
(SELECT 'Friday' AS [Day Of Week],
  'AM' AS [Shift]) AS t

Sample results:

Predicted Service Grade

Results

0.102601830123659

Service Grade$SUPPORT$PROBABILITY$ADJUSTEDPROBABILITY$VARIANCE$STDEV
0.10260183012365983.02325581395350.98837209302325600.001205526606000870.034720694203902
0.9767441860465120.01162790697674420.011627906976744200

For more information about the probability, support, and standard deviation values in the nested NODE_DISTRIBUTION table, see Mining Model Content for Logistic Regression Models (Analysis Services - Data Mining).

Back to top

Sample Query 4: Making Predictions for a Discrete Value

Logistic regression is typically used in scenarios where you want to analyze the factors that contribute to a binary outcome. Although the original model used in the Intermediate Tutorial predicts a continuous value, Service Grade, in a real-life scenario you might want to predict whether service grade meet some discrete target value. Alternatively, you could output the predictions using continuous values then group the predicted outcomes into Good, Fair, or Poor.

The following sample illustrates how to change the way that the predictable attribute is grouped. To do this, you make a copy of the mining structure and then change the discretization method of the target column. The following procedure describes how to change the grouping of Service Grade values in the Call Center data.

To create a discretized version of the Call Center mining structure and models

  1. In Business Intelligence Development Studio, in Solution Explorer, expand Mining Structures.

  2. Right-click Call Center.dmm and select Copy.

  3. Right click Mining Structures and select Paste. A new mining structure iss added, named Call Center 1.

  4. Right-click the new mining structure and select Rename. Type the new name, Call Center Discretized.

  5. Double-click the new mining structure to open it in the designer. Notice that the mining models have all been copied as well, and all have the extension 1. Leave the names as is for now.

  6. In the Mining Structure tab, right-click the column for Service Grade, and select Properties.

  7. Change the Content property from Continuous to Discretized. Change the DiscretizationMethod property to Clusters. For Discretization BucketCount, type 3.

    Note

    These parameters are just used for illustrating the process, and do not necessarily produce a valid model,

  8. From the Mining Model menu, select Process structure and all models.

The following sample query is based on this discretized model, and predicts the service grade for the specified day of the week, together with the probabilities for each predicted outcome.

SELECT
  (PredictHistogram([Call Center_LR 1].[Service Grade])) as [Predictions]
FROM
  [Call Center_LR 1]
NATURAL PREDICTION JOIN
(SELECT 'Saturday' AS [Day Of Week]) AS t  

Expected results:

Predictions

Service Grade$SUPPORT$PROBABILITY$ADJUSTEDPROBABILITY$VARIANCE$STDEV
0.1087271838312535.72465047706410.4252934580602870.017016836003029300
0.0585576923062531.70988808007030.3774986676198850.02088202006045400
0.17016949152515.61091598832020.1858442379561920.066138657138604900
0.9545454545454550.01136363636363640.011363636363636400

Note that the predicted outcomes have been grouped into three categories as specified; however, these groupings are based on the clustering of actual values in the data, not arbitrary values that you might set as business goals.

Back to top

List of Prediction Functions

All Microsoft algorithms support a common set of functions. However, the Microsoft Logistic Regression algorithm supports the additional functions listed in the following table.

For a list of the functions that are common to all Microsoft algorithms, see Mapping Functions to Query Types (DMX). For the syntax of specific functions, see Data Mining Extensions (DMX) Function Reference.

Note

For neural network and logistic regression models, the PredictSupport (DMX) function returns a single value that represents the size of the training set for the entire model.