Basket analysis with SQL Data Warehouse
Stefan Cronjaeger, Technical Solution Professional
Michael Hlobil, Architect, Data Insights Global Practice
Shopping basket analysis typically asks questions like: "If a customer bought product A what might she also be interested in?". This is typically used for recommendation engines or for arrangement of products in the shelves of the shop.
In order to answer these questions, typical shop transactions have to be analyzed, products and product pairs have to be counted, how often they are bought together versus how often they are not found in the same shopping basket. This analysis has to be performed on large data sets for many products. This typically leads to memory issues and runtime problems.
Programming environment like R have packages (e.g., arules with the apriori functionality) for this type of analysis, but this often leads to long runtimes and is restricted by the available memory. This may lead to aborted jobs when running such an analysis on large data sets in R.
Here we show such an analysis in a Microsoft's Azure-based SQL Datawarehouse for data from Adventureworks. The SQL Datawarehouse is optimized for scanning large data sets and is based on a MPP architecture which can perform this kind of analysis in parallel.
Some concepts
Before starting the analysis, we will introduce some concepts for basket analysis: Support, Confidence and Lift.
- Support: The support of a product or a product pair is the fraction of transactions which contains the product. As an example: If milk is bought in 10% of the transactions (contained in 10% of the shopping baskets), the support of milk is 0.1
- Lift: Let us take the example of milk and bread. If milk is bought in 10% of the transactions and bread is bought in 10% of the transactions (both have the support of 0.1), the probability that they are both in the same shopping basket is the product of the probabilities for the single products. So the expected support would be 0.1 * 0.1 = 0.01, i.e., it could be expected that they are found together in 1 percent of the shopping baskets. This is only true if the probabilities are independent. So if the customers typically buy them together, they will have a larger support, if the customer avoid them together, the support will be smaller than 1%. The lift is the the real support divided by the product of the single supports. So if milk AND bread are found in 2% of the shopping baskets, the lift would be 2 (0.02/0.01). So a lift > 1 is an indication that products are bought together.
- Confidence: The confidence is conditional probability. An example: If milk is bought in 10% of the transactions and cream in 2% of the transactions, but cream always bought together with milk, we have the following confidences: If the customer buys milk, there is a 20% confidence that she also buys cream (in 20% of the shopping baskets with milk she will put cream). In the other direction: If the customer buys cream, there is a 100% confidence that she also buys milk (milk is always found in the shopping baskets with cream).
The Analysis
In dbo.FactInternet sales there are 60398 transactions which contain 158 products. In large Data Warehouses you may find billions of transactions for thousands of products. Therefore, we should create a smaller table for the 100 most important products. Here we define them as the products most sold. They are created in a new flat de-normalized table which also contains the product subcategory and product category:
CREATE TABLE dbo.importantProducts
WITH(DISTRIBUTION = Round_Robin)
AS
select Top 100 count(*) as nTrans, S.ProductKey, PS.ProductSubcategoryKey, PC.ProductCategoryKey, P.EnglishDescription, Ps.EnglishProductSubcategoryName,PC.EnglishProductCategoryName
from dbo.FactInternetSales S
INNER JOIN dbo.dimProduct P on S.ProductKey=P.ProductKey
INNER JOIN dbo.dimProductSubcategory PS on P.ProductSubcategoryKey = PS.ProductSubcategoryKey
INNER JOIN dbo.dimProductCategory PC on PS.ProductCategoryKey = PC.ProductCategoryKey
group by S.ProductKey, PS.ProductSubcategoryKey, PC.ProductCategoryKey, P.EnglishDescription, Ps.EnglishProductSubcategoryName,PC.EnglishProductCategoryName
order by count(*) DESC
The created table is shown below:
Later on we have to look into the shopping baskets which contain these important products. They are only of interest if they contain at least 2 products in the basket. So we create an intermediate table with large transactions. In dbo.FactInternetSales the shopping baskets are identified by the SalesOrderNumber:
CREATE TABLE dbo.largeTransactions
WITH(DISTRIBUTION = Round_Robin)
AS
select SalesOrderNumber, count(*) as nProducts
from dbo.FactInternetSales
group by SalesOrderNumber
having count(*) > 1
And use this intermediate table to reduce the fact table to transactions with more than 1 product in the basket and having important products:
-- keep only the transactions with important products and nProducts > 1
CREATE TABLE dbo.FactSalesKept
WITH(DISTRIBUTION = hash(SalesOrderNumber))
AS
SELECT s.ProductKey, p.nTrans, s.SalesOrderNumber
FROM [dbo].FactInternetSales s
INNER JOIN [dbo].[importantProducts] p
ON s.ProductKey = p.ProductKey
INNER JOIN [dbo].largeTransactions t
ON s.SalesOrderNumber = t.SalesOrderNumber
GO
Now we have created a table which only contains transactions with 2 or more products in the shopping basket and one of the products is an important product.
Now we have the question how to define the support: The number of shopping baskets containing the product divided by all shopping baskets or divided by the interesting shopping baskets. We take the second alternative. Since we have to divide, we declare the variable as decimal
declare @count decimal(12,5) -- the number of shopping baskets
set @count = (select count (distinct SalesOrderNumber) from dbo.FactSalesKept)
select @count -- 16518.00000
CREATE TABLE dbo.importantProductsNewCount
WITH(DISTRIBUTION = Round_Robin)
AS
select Top 100 count(*) as nTrans, S.ProductKey, PS.ProductSubcategoryKey, PC.ProductCategoryKey, P.EnglishDescription, Ps.EnglishProductSubcategoryName,PC.EnglishProductCategoryName
from dbo.FactInternetSales S
INNER JOIN dbo.dimProduct P on S.ProductKey=P.ProductKey
INNER JOIN dbo.dimProductSubcategory PS on P.ProductSubcategoryKey = PS.ProductSubcategoryKey
INNER JOIN dbo.dimProductCategory PC on PS.ProductCategoryKey = PC.ProductCategoryKey
group by S.ProductKey, PS.ProductSubcategoryKey, PC.ProductCategoryKey, P.EnglishDescription, Ps.EnglishProductSubcategoryName,PC.EnglishProductCategoryName
order by count(*) DESC
The result:
So the products with the biggest support are water bottles, patches, tubes.
We can create a table with product pairs and their expected support:
CREATE TABLE dbo.Pairs
WITH(DISTRIBUTION = Round_Robin)
AS
select P1.ProductKey as PK1, P1.ProductSubcategoryKey as PSK1, P1.EnglishDescription as Desc1, P1.EnglishProductSubcategoryName as PSDesc1, P1.Support as Supp1,
P2.ProductKey as PK2, P2.ProductSubcategoryKey as PSK2, P2.EnglishDescription as Desc2, P2.EnglishProductSubcategoryName as PSDesc2, P2.Support as Supp2,
P1.Support*P2.support as expectedSupport
from dbo.ProductsSupport P1
INNER join dbo.ProductsSupport P2
ON P1.ProductKey < P2.ProductKey
Until this point, the queries basically counted and joined one large fact table with smaller dimensions. The following query is the critical one, and the reason why we reduced the product table and the fact table. It joins the fact table with itself and creates a table with product pairs:
CREATE TABLE dbo.PairSupport
WITH(DISTRIBUTION = Round_Robin)
AS
select P.PK1, P.PK2, P.expectedSupport, count(*) as Support, count(*)/@count/P.expectedSupport as Lift, P.Desc1, P.Desc2, P.PSK1, P.PSK2, count(*)/Supp1/@count as Conf2, count(*)/Supp2/@count as Conf1, Supp1, Supp2
from dbo.Pairs P
INNER JOIN dbo.FactSalesKept F1 on F1.ProductKey = P.PK1
INNER JOIN dbo.FactSalesKept F2 on F2.ProductKey = P.PK2 AND F1.SalesOrderNumber = F2.SalesOrderNumber
group by PK1, PK2, expectedSupport, PSK1, PSK2, Supp1, Supp2, P.Desc1, P.Desc2
This table contains 117 rows (out of 4950 product pairs) because not all pairs are in some shopping basket. In real cases it will contain many more rows.
It may be queried:
Which products are bought together:
select * from dbo.PairSupport
order by Lift desc
So it is tubes and rubber and other pairs. They have a lift of 10. Since products of the same category are quite often located near to each other in a store (of course, we here look on Internet Sales), it may be interesting to look for pairs in different product categories or sub-categories:
select * from dbo.PairSupport
where PSK1 <> PSK2
order by Lift desc
Here it is patches and some washing, or bottle cage and clip-on fenders.