Imagine you have an online shop and you would like to know which products often bought together. This task is known under the term of market basket analysis, in which retailers seek to understand the purchase behavior of their customers. This information can then be used for purposes of cross-selling and up-selling (Wikipedia).

Let us assume we have a data set which contains a list of customers of an online shop and the products they have bought (or viewed) in the past. We can see that one customer can have bought multiple products.

UserID |
ProductID |

10039052252084471969 | Product_587 |

10039052252084471969 | Product_40 |

10039052252084471969 | Product_154 |

10046183258816255929 | Product_256 |

10046183258816255929 | Product_44 |

10047293680636077566 | Product_1184 |

10055849645924040293 | Product_334 |

10060944748730254910 | Product_306 |

10060944748730254910 | Product_154 |

10060944748730254910 | Product_78 |

… | … |

We will use a rule-based machine learning algorithm called Apriori to perform our market basket analysis. It is intended to identify strong rules/relations discovered in a data set. The easiest way to understand association rule mining is to look at the results of such an analysis. To do that we first want to read in our data set from above as transactions in single format. I saved my data as a csv file with two rows called mydata. After this we will use the apriori algorithm from the arules package to identify strong rules in the data set.

# read in data trans = read.transactions("mydata.csv", format = "single", sep = ";", cols = c("transactionID", "productID"), encoding = "UTF-8") # run apriori algorithm rules = apriori(trans, parameter = list(supp = 0.005, conf = 0.001, minlen = 2)) # sort rules by lift rules = sort(rules, by = "lift", decreasing = T) # print out rules to console inspect(rules) # remove redundant rules subset.matrix = is.subset(rules,rules) subset.matrix[lower.tri(subset.matrix,diag=T)] = 1 rules = rules[!redundant] # write to df ruledf = data.frame( lhs = labels(lhs(rules)), rhs = labels(rhs(rules)), rules@quality) # filter rules on specific product ruledf = filter(ruledf, lhs == "{Product_183}")

Running the apriori algorithm with the code above will give us a list of association rules based on our input data. Let us have a look at the output of the model to see what these rules look like. You can use the inspect command from the arules package to print out rules to the console.

lhs |
rhs |
support |
confidence |
lift |

{Product_125} | {Product_306} | 0.006 | 0.387 | 4.040 |

{Product_306} | {Product_125} | 0.006 | 0.072 | 4.040 |

{Product_63} | {Product_385} | 0.005 | 0.400 | 19.472 |

{Product_385} | {Product_63} | 0.005 | 0.285 | 19.472 |

{Product_264} | {Product_92} | 0.005 | 0.378 | 27.143 |

{Product_92} | {Product_264} | 0.005 | 0.360 | 27.143 |

{Product_523} | {Product_306} | 0.005 | 0.369 | 3.859 |

{Product_306} | {Product_523} | 0.005 | 0.061 | 3.858 |

{Product_102} | {Product_120} | 0.005 | 0.506 | 8.460 |

… | … | … | … | … |

An example rule for our data set could be {product_125} ⇒ {product_306} meaning that if product_125 is bought, customers also buy product_306. To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best-known constraints are minimum thresholds on support and confidence. The support is defined as the proportion of transactions in the data set which contain the specific product(s). In the table above, the rule {product_125} ⇒ {product_306} has a support of 0.006 meaning that the two products have been bought together in 0.6% of all transactions. The confidence is another important measure of interest. The rule {product_125} ⇒ {product_306} has a confidence of 0.387, which means that the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS is 38.7%. If you want to execute the Apriori algorithm you will need to define both a minimum support and a minimum confidence constraint at the same time. This will help you filter out interesting rules. We also defined a minimum length of two because we want the rule to cover at least two products. Another popular measure of interest is the lift of a association rule. The lift is defined as lift(X ⇒ Y ) = supp(X ∪ Y )/(supp(X)supp(Y)), and can be interpreted as the deviation of the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS. Greater lift values indicate stronger associations. There is a lot more to discover about association rule mining with the arules package if you look at its reference manual.

lhs |
rhs |
support |
confidence |
lift |

{Product_92} | {Product_264} | 0.005 | 0.360 | 27.143 |

{Product_374} | {Product_378} | 0.006 | 0.398 | 21.923 |

{Product_98} | {Product_929} | 0.012 | 0.556 | 20.165 |

{Product_375} | {Product_376} | 0.007 | 0.365 | 20.139 |

{Product_257} | {Product_880} | 0.006 | 0.378 | 19.847 |

{Product_63} | {Product_385} | 0.005 | 0.400 | 19.472 |

{Product_908} | {Product_98} | 0.007 | 0.412 | 18.702 |

{Product_376} | {Product_378} | 0.006 | 0.331 | 18.338 |

{Product_378} | {Product_375} | 0.006 | 0.384 | 17.824 |

{Product_54} | {Product_719} | 0.005 | 0.256 | 17.415 |

… | … | … | … | … |

At the table above we sorted our rules via lift and now we can see the top 10 most interesting associations in our data set. This information can now be used for purposes of cross-selling and up-selling. We also removed the redundant rules from this table. As you can see from the code above you can also easily use filters to find rules for a specific product. Find the whole code along with other projects on my Github.