Association Rule Learning in SQL

Nancy Dong

Feb. 21, 2019


1. No. of Possible Two-Way Combinations

Steps:

  1. Calculates the number of distinct products in each order
  2. Calculates the number of possible two-way product combinations in each order
  3. Sum up the number of possible combinations in each order across all orders

SELECT SUM(numprods * (numprods - 1) / 2) AS num_pair_combo
FROM
  (SELECT orderid,
          COUNT(DISTINCT productid) AS numprods
   FROM orderline
   GROUP BY orderid) numprods_table


num_pair_combo
185791.0000

2. Generating All Two-Way Combinations

2.1 Step 1

Select only unique combinations of OrderID and ProductID from the orderline table.


SELECT DISTINCT OrderId,
                ProductId
FROM orderline

LIMIT 5;
OrderId ProductId
1006414 10834
1006541 11052
1006542 11070
1010154 11196
1009110 11048

2.2 Step 2

  • Two identical OrderID-ProductID combination tables (with the alias op) are joined together to generate all pairs of products within each order
  • The requirement op1.ProductId < op2.ProductId ensure that:
    • No pair contains two identical products
    • No two pairs have the same products

WITH op as (SELECT DISTINCT OrderId, ProductId FROM OrderLine)
SELECT op1.OrderId,
       op1.ProductId AS product_ID_1,
       op2.ProductId AS product_ID_2
   FROM op op1
   JOIN op op2
   ON op1.OrderId = op2.OrderId AND op1.ProductId < op2.ProductId 
LIMIT 5;
OrderId product_ID_1 product_ID_2
1006414 10834 10842
1009110 11048 11196
1010190 11048 11196
1008577 10842 11197
1008588 10837 10842

2.3 Step 3

Count the frequency of the appearance of each pair of products.


SELECT product_ID_1,
       product_ID_2,
       COUNT(*) AS numorders
FROM
  ( 

    WITH op as (SELECT DISTINCT OrderId, ProductId FROM OrderLine)

    SELECT op1.OrderId,
       op1.ProductId AS product_ID_1,
       op2.ProductId AS product_ID_2
    FROM op op1 
    JOIN op op2
    ON op1.OrderId = op2.OrderId AND op1.ProductId < op2.ProductId 
    ORDER BY orderid 

   ) combinations

GROUP BY product_ID_1,
         product_ID_2

LIMIT 5;
product_ID_1 product_ID_2 numorders
10834 10842 63
11048 11196 1824
10842 11197 15
10837 10842 18
10837 10930 16

2.4 Variation: Restrict Order Size

Sometimes, unusually small or large orders should be excluded from the analysis. Here, the order size is restricted to 2-10 products.


WITH op AS (SELECT DISTINCT orderid, productid
            FROM orderline)
SELECT product_id_1,
       product_id_2,
       COUNT(*) AS numorders
FROM (SELECT op1.orderid,
             op1.productid AS product_id_1,
             op2.productid AS product_id_2
      FROM
        (SELECT orderid
         FROM orderline
         GROUP BY orderid
         HAVING COUNT(DISTINCT orderlineid) BETWEEN 2 AND 10) filter
      JOIN op op1
          ON filter.orderid = op1.orderid
      JOIN op op2
          ON op1.orderid = op2.orderid AND op1.productid < op2.productid) combinations
GROUP BY product_id_1,
         product_id_2
ORDER BY numorders DESC

LIMIT 5;
product_id_1 product_id_2 numorders
12820 13190 2580
12819 12820 1839
11048 11196 1822
10956 12139 1481
12139 12820 1239

3. Examples of Item Sets

This is the same table as in section 2.3, joined with the product table to obtain product group name information for each product.

SELECT product_id_1,
       product_id_2,
       productgroup_1,
       productgroup_2,
       numorders
FROM (SELECT product_id_1,
             product_id_2,
             COUNT(*) AS numorders
      FROM (SELECT op1.orderid,
                   op1.productid AS product_id_1,
                   op2.productid AS product_id_2
            FROM  
              (SELECT orderid
               FROM orderline
               GROUP BY orderid
               HAVING COUNT(DISTINCT orderlineid) BETWEEN 2 AND 10) filter               
                JOIN  
                (SELECT DISTINCT orderid,
                                 productid
                 FROM orderline) op1
                     ON filter.orderid = op1.orderid
                JOIN
              (SELECT DISTINCT orderid,
                               productid
               FROM orderline) op2
              ON op1.orderid = op2.orderid
                AND op1.productid < op2.productid) combinations
      GROUP BY product_id_1,
               product_id_2) aaa
       JOIN
     (SELECT productid,
             productgroupname AS productgroup_1
      FROM product) bbb
     ON aaa.product_id_1 = bbb.productid
       JOIN
     (SELECT productid,
             productgroupname AS productgroup_2
      FROM product) ccc
     ON aaa.product_id_2 = ccc.productid
ORDER BY numorders DESC
LIMIT 5;
product_id_1 product_id_2 productgroup_1 productgroup_2 numorders
12820 13190 FREEBIE FREEBIE 2580
12819 12820 FREEBIE FREEBIE 1839
11048 11196 ARTWORK BOOK 1822
10956 12139 FREEBIE OCCASION 1481
12139 12820 OCCASION FREEBIE 1239

Note: The identified product bundles commonly reflect products that are marketing together, rather than chosen by the customers themselves.