Using association rules to perform a market basket analysis

Imagine you have an online shop and you would like to know which products often bought together. This task is known under the term of market basket analysis, in which retailers seek to understand the purchase behavior of their customers. This information can then be used for purposes of cross-selling and up-selling (Wikipedia).

amazon_funny-300x249

Let us assume we have a data set which contains a list of customers of an online shop and the products they have bought (or viewed) in the past. We can see that one customer can have bought multiple products.

UserID ProductID
10039052252084471969 Product_587
10039052252084471969 Product_40
10039052252084471969 Product_154
10046183258816255929 Product_256
10046183258816255929 Product_44
10047293680636077566 Product_1184
10055849645924040293 Product_334
10060944748730254910 Product_306
10060944748730254910 Product_154
10060944748730254910 Product_78

We will use a rule-based machine learning algorithm called Apriori to perform our market basket analysis. It is intended to identify strong rules/relations discovered in a data set. The easiest way to understand association rule mining is to look at the results of such an analysis. To do that we first want to read in our data set from above as transactions in single format. I saved my data as a csv file with two rows called mydata. After this we will use the apriori algorithm from the arules package to identify strong rules in the data set.

# read in data
trans = read.transactions("mydata.csv", format = "single", sep = ";", cols = c("transactionID", "productID"), encoding = "UTF-8")

# run apriori algorithm
rules = apriori(trans, parameter = list(supp = 0.005, conf = 0.001, minlen = 2))

# sort rules by lift
rules = sort(rules, by = "lift", decreasing = T)

# print out rules to console
inspect(rules)

# remove redundant rules
subset.matrix = is.subset(rules,rules)
subset.matrix[lower.tri(subset.matrix,diag=T)] = 1
rules = rules[!redundant]

# write to df
ruledf = data.frame(
 lhs = labels(lhs(rules)),
 rhs = labels(rhs(rules)),
 rules@quality)

# filter rules on specific product
ruledf = filter(ruledf, lhs == "{Product_183}")

Running the apriori algorithm with the code above will give us a list of association rules based on our input data. Let us have a look at the output of the model to see what these rules look like. You can use the inspect command from the arules package to print out rules to the console.

lhs rhs support confidence lift
{Product_125} {Product_306} 0.006 0.387 4.040
{Product_306} {Product_125} 0.006 0.072 4.040
{Product_63} {Product_385} 0.005 0.400 19.472
{Product_385} {Product_63} 0.005 0.285 19.472
{Product_264} {Product_92} 0.005 0.378 27.143
{Product_92} {Product_264} 0.005 0.360 27.143
{Product_523} {Product_306} 0.005 0.369 3.859
{Product_306} {Product_523} 0.005 0.061 3.858
{Product_102} {Product_120} 0.005 0.506 8.460

An example rule for our data set could be {product_125} ⇒ {product_306} meaning that if product_125 is bought, customers also buy product_306. To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best-known constraints are minimum thresholds on support and confidence. The support is defined as the proportion of transactions in the data set which contain the specific product(s). In the table above, the rule {product_125} ⇒ {product_306} has a support of 0.006 meaning that the two products have been bought together in 0.6% of all transactions. The confidence is another important measure of interest. The rule {product_125} ⇒ {product_306} has a confidence of 0.387, which means that the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS is 38.7%. If you want to execute the Apriori algorithm you will need to define both a minimum support and a minimum confidence constraint at the same time. This will help you filter out interesting rules. We also defined a minimum length of two because we want the rule to cover at least two products. Another popular measure of interest is the lift of a association rule. The lift is defined as lift(X ⇒ Y ) = supp(X ∪ Y )/(supp(X)supp(Y)), and can be interpreted as the deviation of the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS. Greater lift values indicate stronger associations. There is a lot more to discover about association rule mining with the arules package if you look at its reference manual.

lhs rhs support confidence lift
{Product_92} {Product_264} 0.005 0.360 27.143
{Product_374} {Product_378} 0.006 0.398 21.923
{Product_98} {Product_929} 0.012 0.556 20.165
{Product_375} {Product_376} 0.007 0.365 20.139
{Product_257} {Product_880} 0.006 0.378 19.847
{Product_63} {Product_385} 0.005 0.400 19.472
{Product_908} {Product_98} 0.007 0.412 18.702
{Product_376} {Product_378} 0.006 0.331 18.338
{Product_378} {Product_375} 0.006 0.384 17.824
{Product_54} {Product_719} 0.005 0.256 17.415

At the table above we sorted our rules via lift and now we can see the top 10 most interesting associations in our data set. This information can now be used for purposes of cross-selling and up-selling. We also removed the redundant rules from this table. As you can see from the code above you can also easily use filters to find rules for a specific product. Find the whole code along with other projects on my Github.

Advertisements

Author: inside data blog

data analysis & visualization blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s