Azure ML Studio: Synthetic Minority Oversampling Technique (SMOTE)

Introduction

In the previous article , we examined the Smote algorithm with C # (its expansion is Synthetic Minority Oversampling Technique ). The algorithm helps balance the number of minority classes to the number of the majority class by generating synthetic values.

In the previous article, we talked about the fine details of the algorithm. To do a short repetition. The samples that were in the minority were visited one by one, and one of the n other samples, which was the closest to each sample, was randomly selected. Within the distance between these two points, new points were generated randomly in the desired amount.

Let's see how to do this on Azure ML Studio. We will start by using the "Adult Census Income Binary Classification dataset" from the ready data to test it. In this data, there is a clear shortage of data belonging to the class.

Description

We need to tell here which is the Label Column label column. For this dataset, this will be "income". The module will automatically find the minority class itself. The important point here is that the label quality must be dual.

SMOTE Percentage asks what to produce synthetic samples from the minority class.

Number of nearest neighbors allows you to set how far the random point will be selected for each point. If the sentence seems meaningless, take a look at the previous article.

Random seed will generate different values ​​each time we run it, as our algorithm includes randomness. This will prevent us from testing it. To prevent this, a seed value we will write here will ensure that the same values ​​are produced continuously.

In the tag column part, I mentioned that he only works for binary classes. What if there are more than 2 classes in our data? How can we balance the classes using SMOTE? Many different methods can be applied, for example, we can group the data in 2 with the most crowded class and run SMOTE for each group. We will also use the "Restaurant ratings" data from the data sets that come with Azure ML Studio.

The scripts that we will run sequentially starting from the top left are as follows: For the upper left corner, bring non-1 classes.

def azureml_main(dataframe1 = None, dataframe2 = None):

    filter = (dataframe1["rating"] != 1 )

    return dataframe1[filter],COPY

Bring non-0 classes for the upper right corner.

def azureml_main(dataframe1 = None, dataframe2 = None):

    filter = (dataframe1["rating"] != 0 )

    return dataframe1[filter],COPY

For the bottom left, bring non-2 classes.

def azureml_main(dataframe1 = None, dataframe2 = None):

    filter = (dataframe1["rating"] != 2 )

    return dataframe1[filter],

 

 Note

 We give the values of 91 and 15 for SMOTE boxes, respectively. Because we have 486 samples from the most crowded class 2. There are 254 examples from class 1. We find that the value we need to enter for class 1 is 91. We apply the same logic to the other class containing 421 elements and find 15. Finally, we combine the results with the "Add Rows" module. As a result, our scatter plot will look like this:\left(\frac{486}{254} -1\right) * 100(254486−1)∗100.