MACHINE LEARNING

# Looking for Similarities With Association Rule Mining

by **Igor Kleiman**

23 May 2019 • 13 min read

**We like having our ten stripes in a row and we ask ourselves who else does?**

Life is unfair! Some geniuses are acknowledged and honored during their lifetime, some even long after it, but there are a lot of fundamental research pioneers who have unfortunately fallen into obscurity in society’s collective memory. Petr Hájek (6 Feb. 1940 – 26 Dec. 2016), Professor and Director of Research at the Institute of Mathematics of the Czech Republic’s Academy of Science, was one of those pioneers who built the foundation of modern machine learning theory, and it is he to whom I devote this article.

More than 50 years ago, in 1966 – the year of the mini-skirt, Vietnam war protests and the space-race between United States and Soviet Union – Petr Hájek, together with his colleagues Tomáš Havránek, Ivan Havel and Metoděj Chytil, formulated the GUHA (General Unary Hypotheses Automation) principle. The GUHA principle describes the idea of using computers to generate series of hypotheses that describe relations between the properties of objects. Ultimately, the GUHA principle forms the foundation for Knowledge Discovery in Databases (KDD) (better known as data mining or knowledge extraction) and for unsupervised machine learning techniques, the research areas being emerged several decades later.

In this article, I would like to talk about association analysis, a simple but extremely useful unsupervised learning technique that traces back to Hájek’s GUHA principle. Generally speaking, association analysis addresses the problem of finding highly probable subsets of data and the highly probable subset combinations.

As we know, many businesses have enormous amounts of customer and customer purchasing data. For example, a grocery store definitely owns data containing the product names purchased in each transaction. Other retailers and the majority of online dealers also have similar data on purchases at their stores. Using these data, we might want to find subgroups of products that tend to co-occur, in either purchasing or in viewing behavior.

Retailers might be interested in cross-promoting product combinations with different deals, if they know that these products are very highly correlated in such a way that people purchase them together. Online platforms can use this information to generate content recommendations. Another huge topic is strategic product placement within grocery stores, because certain product arrangements encourage consumers to purchase them both at the same time. And of course, the variety of use cases goes far beyond the grocery store example! For instance Netflix-like applications recommending content based on viewing behavior, even without any feedback at all from the customer. Further embodiments of these ideas could lead to use cases covering the analysis of social trends and even the promotion of political campaigns.

Let’s look at the simplest version of association analysis based on grocery basket analysis – the canonical way of thinking about finding associations among data. We’re assuming that six people have a basket at a grocery store, and they have different objects in their basket at checkout time. Imagine now that we have millions of these checkout transactions and they’re across thousands of products; we now might want to use this type of data to analyze **patterns of co-occurrence**!

Which products tend to appear in baskets at what rate, and what are the association rules? And **knowing those association rules, given that they’ve got one object in their basket, we’re going to predict that they are more likely to have a second object in their basket**.

For example, consider the following “grocery baskets” of six customers:

Basket ID |
Items |

1 | Cola, Butter, Bread, Cheese |

2 | Bread, Cheese, Milk |

3 | Bread, Fanta, Beer, Eggs, Butter |

4 | Eggs, Salami, Butter, Beer, Cola, Toast |

5 | Toast, Eggs, Butter, Fanta |

6 | Bread, Milk, Butter, Fanta, Cheese |

Based on these data, we want to analyze patterns of co-occurrence:

\(Bread \longrightarrow Butter\)

Let’s look at this problem from a more abstract perspective.

First, imagine that we have *\(i\)* different objects, for example, there could be *\(i\)* different items for sale in a grocery store, and each of these items has a unique index from one to \(I\)

\(i \in \lbrace1, …,I\rbrace\)

Then let us assume we have a collection of subsets of these items, i.e. \(D_n\) for the \(n^{th}\) subset would be a subset of items from one to \(I\). In other words, we should think of \(D_n\) as giving the list of things purchased by customer \(n = 1, …,\,N\)

\(D_n =\,\subset \lbrace1, …,\,I\rbrace\)

So now there are **two objectives** we want to accomplish here. The first is to perform an association analysis. This is simply about locating subsets of products that have a high probability of occurring together. For instance, let’s agree on

\(K\), being \(bread\), \(milk\) and \(eggs\)

and then we’ll count how many of the grocery baskets contain those three products together. By dividing this figure by the total number of baskets, we get the fraction of customers that purchased bread, milk and eggs together:

\(P(K) = \frac{\#\lbrace n\ such\ that\ K\,\subseteq\,D_n\rbrace}{N}\)

So, our goal is to find these subsets \(K\), where \(P(K)\) is a large number.

Another objective is to discover association rules. This is the problem of finding objects that are highly correlated. Let \(A\) and \(B\) be two disjoint subsets \(A\,\cap\,B\,=\,\emptyset\) of the products \(\lbrace 1…I\rbrace\). Then \(A \Longrightarrow B\) can be interpreted in a way that purchasing \(A\) increases the likelihood of purchasing \(B\) as well.

In order to learn all those figures, we need to represent our basket in a different way:

Basket ID |
Beer |
Bread |
Butter |
Cheese |
Cola |
Eggs |
Boys |
Milk |
Salami |
Toast |

1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |

2 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |

3 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |

4 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |

5 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |

6 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |

Following our first objective, we’re looking for a subset of those 10 items that co-occur with high probability. This probability could be defined by some threshold, let’s say 49%. In this simple case, we could easily discover that eggs and butter appear in 50% of all baskets, which is beyond the threshold and therefore we’re stating that these two items co-occur with high probability in this dataset.

Unfortunately, or luckily 😉, real-life situations are much more sophisticated. There is a good probability that we have more than 6 customers and way more items for sale. So how about having \(N \approx 10^8\) and \(I \approx 10^4\)?

Using straightforward combinatorics would lead us to the conclusion that a simple brute-force search approach is not going to be possible in this case.

The first question we could ask ourselves is how many product subsets \(K \subseteq \lbrace 1, …, I\rbrace\) are there at all? Each subset can be represented by a binary indicator vector of length \(I\) and the total number of possible vectors is \(2^I\).

And what if we didn’t check all those possible product combinations, as only very few people (and I think I know them all personally 😊) will have a basket with all of the grocery store items in it? How about if we only check up to \(b\) items? Hmm, the number of sets of size \(b\) picked from \(I\) items is

\(({_b^I}) = \frac{I!}{b!(I\,–\,b)!}\)

which might become a lifetime job with a lot of follow-up tasks for your descendants. If we had only \(I = 10^4\) and \(b = 5\) the total number of possible product combinations would be

\(({_5^{10^4}}) = 10^{18}\)

As you can imagine, the solution to this tricky problem needs an algorithm that reduces the calculation effort and somehow reduces the count of subsets \(K \subseteq \lbrace 1, …, I\rbrace\) that need to be taken into consideration.

But first, let us decide which terms need to be calculated. As already mentioned, we’re still looking at all relevant product subsets \(K \subseteq \lbrace 1, …, I\rbrace\). And of course, we’re still bearing in mind those disjoint product subsets \(A\) and \(B\) so that \(A,B \subset K\), \(A \cup B = K\), \(A \cap B = \emptyset\).

So the first term of an interest is the probability of \(K\), which can also be written as the joint probability of subsets \(A\) and \(B\).

\(P(K) = P(A, B)\)We call this term the prevalence or **support** of the items in \(K\). In essence, we’re looking for all subsets of \(K\) such that their combination in \(K\) co-occurs often.

E.g. if \(K = \lbrace Eggs, Butter, Milk\rbrace\), \(A = \lbrace Eggs\rbrace\), \(B = \lbrace Butter, Milk\rbrace\) and \(P(K) = 0,2\) then eggs, butter and milk will be purchased together in 20% of all cases (baskets).

And then we’d like to learn the conditional probability of set \(B\) given set \(A\). In other words, if you know that \(A\) is in the grocery basket, what is the probability that \(B\) will also appear in the basket. We call it **confidence**, which will be used as the basis for assumption \(A \Longrightarrow B\) (purchasing \(A\) increases the likelihood of also purchasing \(B\)).

\(P(B \vert A) = \frac{P(A, B)}{P(A)} = \frac{P(K)}{P(A)}\)

e.g. if \(K = \lbrace Eggs, Butter, Milk\rbrace\), \(A = \lbrace Eggs\rbrace\), \(B = \lbrace Butter, Milk\rbrace\) and \(P(B \vert A) = 0,65\) then eggs were purchased in 65% of all checkouts where butter and milk were also purchased.

Finally, we want to know how much more certain we are of getting \(B\) in the basket if \(A\) is already there, and that compared to the simple prevalence of \(B\).

\(L(A \vert B) = \frac{P(A, B)}{P(A)P(B)} = \frac{P(K)P(A)}{P(A)P(B)} = \frac{P(K)}{P(B)}\)

This is called the **lift** of rule \(A \Longrightarrow B\) (purchasing \(A\) increases the likelihood of also purchasing \(B\)).

E.g. if \(K = \lbrace Eggs, Butter, Milk\rbrace\), \(A = \lbrace Eggs\rbrace\), \(B = \lbrace Butter, Milk\rbrace\) and \(L(A \vert B) = 1,6\) then it is 1,6 times more likely that butter and milk will be purchased given that eggs are in the grocery basket.

Again, remember that we agreed on setting a threshold \(\tau\) and that the empirical probability of \(K\) needs to exceed this threshold. The GUHA principle based on the **Apriori** algorithm provides us with the solution to find all subsets \(K\) without having to search in all \(2^I\) product combinations.

First, we need to set \(0 < \tau < 1\) such that only a relatively small fraction of all subsets \(K\) satisfies \(P(K) > \tau\). The Apriori algorithm restricts the number of subsets \(K\) that need to be checked by comparing their probabilities with the above-mentioned threshold. And of course, if \(K\) satisfies \(P(K) > \tau\) it appears in \(N \cdot \tau\) of \(N\) baskets.

Before we come to the definition of the Apriori algorithm, let me remind you of some basic probability rules that form its foundation:

If \(K’ = K \cup A\) and \(K \subset \lbrace 1, …, I\rbrace\), \(A \subset \lbrace 1, …, I\rbrace\)

then \(P(K) < \tau \Longrightarrow P(K’) < \tau\)

because \(P(K’) = P(K, A) = P(A \vert K) P(K) ≤ P(K) < \tau\)

And conversely, if \(P(K) > \tau\) and \(A \subset K\) then \(P(A) > P(K) > \tau\)

The problem can be represented as a lattice diagram:

And here is the simplified version of the Apriori algorithm:

- Set threshold \(N \cdot \tau\) where \(0 < \tau < 1\) but needs to be reasonably small
- \(|K| = 1\) for each item. \(P(K) ≥ \tau\) i.e. it needs to be in \(≥ N \cdot \tau\) baskets. Those with \(P(K) < \tau\) need to be dropped
- \(|K| = 2\) for all \(K\) that survived the previous step recheck \(P(K) ≥ \tau\). Those with \(P(K) < \tau\) need to be dropped
- …
- \(|K| = \gamma\) for all \(K\) that survived the previous step recheck \(P(K) ≥ \tau\). Those with \(P(K) < \tau\) need to be dropped and the rest needs to be kept!

As \(\gamma\) increases, the number of sets that survive will decrease! At a certain \(\gamma\), no sets will survive, and we’re done!

Coming back to the above-mentioned association rules – confidence, support and lift – we’ll immediately recognize that having all \(K\) such that \(P(K) ≥ \tau\) where \(A\) and \(B\) are partitions of \(K\), both \(P(A)\) and \(P(B)\) are already there, which means \(P(B \vert A)\) and \(L(A \vert B)\) can be easily calculated without any additional effort.

There are a lot of brilliant implementations of the Apriori algorithm and association rules. In the following coding example, we’d like to provide you with a Python-native implementation of the Apriori algorithm, which is based on a successive reduction of the initial dataset. For association rules we use the MLxtend framework. For this example we use a subset of data from the Insta-Cart Market Basket Analysis dataset on Kaggle. We will be using purchase order data specifying which products were purchased in which order.

order_products train.csv contains extracted previous order contents for all customers:

order_id, | product_id, | add_to_cart_order, | reordered |

1, | 49302, | 1, | 1 |

1, | 11109, | 2, | 1 |

1, | 10246, | 3, | 0 |

… |

The initial dataset contains 131 209 unique orders.

There are two questions we are going to answer in this example:

**What are the sets of the three items found together more than 393 times in the above dataset?**

**What are the association rules between the articles within those subsets?**

Code section:

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import itertools as it
from mlxtend.frequent_patterns import association_rules
# The below function finds all of the product groupings of a specified size,
# and counts how many times they appear
def find_groups_of_size_n(data, size):
group_by = data.groupby("order_id")['product_id'].unique()
group_by = group_by.apply(lambda x: sorted(x))
group_by = pd.DataFrame(group_by)
def groupings(x):
return list(it.combinations(x,size))
group_by['groups'] = group_by['product_id'].apply(groupings)
counts = pd.Series(list(it.chain.from_iterable(group_by['groups'].values))).value_counts()
return counts
# This functions returnes product names for all products ids in the list
def product_lookup(product_ids):
try:
len(product_ids)
names = [products[products.product_id == pid].iloc[0,1] for pid in product_ids]
except:
names = products[products.product_id == product_ids].iloc[0,1]
return names
# This is a service function for the Apriori algorithm removing dupicates from the subset of produts
def getListOfProducts(ser):
products=[]
for element in ser.index:
products+=(list(element))
uniquelist = set(products)
return list(uniquelist)
# This is a simple native implementation of Apriory algorithm
# -------------------------------------------------------
# Find all combinations of three items appearing in at least (Numbero of orders * threshold) grocery # baskets
# Parameters:
# data : pandas.DataFrame: order_id : product_id
# threshold : 0 < real < 1
# Returns:
# pandas.Series: INDEX(list of frequent items) VALUE(count of baskets)
#
def findItemSets(data, threshold):
subset=data.groupby("order_id")['product_id'].count()
orders= subset[subset>=3].count()
subset_reduced= data[data.order_id.isin(subset[subset>=3].index)]
print(subset_reduced.head())
for group_size in range(1, 4):
result=find_groups_of_size_n(subset_reduced,group_size)
print("run ",group_size, "count of elements ", len(result[result>=orders*threshold]))
subset_reduced= subset_reduced[subset_reduced.product_id.isin(getListOfProducts(result[result>=orders*threshold]))]
return result[result>=orders*threshold]
# This functions returnes product names for all products ids in the 2D array
def product_lookup_2D(product_ids):
names =[]
for element in product_ids:
name = [products[products.product_id == pid].iloc[0,1] for pid in element]
names.append(name)
return names
# Generates a DataFrame of association rules including the metrics 'score', 'confidence', and 'lift'
# For usage examples, please see http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/
# Returns
# pandas DataFrame with columns "antecedents" and "consequents" that store itemsets, plus the scoring metric columns
#
def association(resultFrame, threshold):
rules= association_rules(resultFrame, metric="confidence", min_threshold=threshold, support_only=True)
print(rules[['antecedents','consequents']])
#---
#--- ENTRY POINT MAIN
#---
if __name__ == '__main__':
file_path = './resources/grocery/order_products__train.csv'
data = pd.read_csv(file_path)
prod_filepath = "./resources/grocery/products.csv"
products = pd.read_csv(prod_filepath)
unique_orders = len(data.order_id.unique())
print("Overall count of unique orders ", unique_orders)
ser= findItemSets(data, 0.003 )
for index, values in ser.items():
print(product_lookup(index)," occured in ", values, " baskets")
resultFrame= ser.to_frame(name='support')
resultFrame['itemsets'] = product_lookup_2D(resultFrame.index.tolist())
resultFrame['support'] = (resultFrame['support'] / unique_orders)*100
association(resultFrame, 0.5)
```

And the console output looks as follows:

Overall count of unique orders 131209 order_id product_id add_to_cart_order reordered 0 1 49302 1 1 1 1 11109 2 1 2 1 10246 3 0 3 1 49683 4 0 4 1 43633 5 1 run 1 count of elements 589 run 2 count of elements 406 run 3 count of elements 18 ['Bag of Organic Bananas', 'Organic Strawberries', 'Organic Hass Avocado'] occured in 710 baskets ['Bag of Organic Bananas', 'Organic Strawberries', 'Organic Raspberries'] occured in 649 baskets ['Bag of Organic Bananas', 'Organic Strawberries', 'Organic Baby Spinach'] occured in 587 baskets ['Bag of Organic Bananas', 'Organic Raspberries', 'Organic Hass Avocado'] occured in 531 baskets ['Bag of Organic Bananas', 'Organic Baby Spinach', 'Organic Hass Avocado'] occured in 497 baskets ['Organic Baby Spinach', 'Banana', 'Organic Avocado'] occured in 484 baskets ['Banana', 'Large Lemon', 'Organic Avocado'] occured in 477 baskets ['Banana', 'Limes', 'Large Lemon'] occured in 452 baskets ['Bag of Organic Bananas', 'Organic Strawberries', 'Organic Cucumber'] occured in 424 baskets ['Limes', 'Large Lemon', 'Organic Avocado'] occured in 389 baskets ['Organic Strawberries', 'Organic Raspberries', 'Organic Hass Avocado'] occured in 381 baskets ['Organic Strawberries', 'Banana', 'Organic Avocado'] occured in 379 baskets ['Organic Strawberries', 'Organic Baby Spinach', 'Banana'] occured in 376 baskets ['Bag of Organic Bananas', 'Organic Strawberries', 'Organic Blueberries'] occured in 374 baskets ['Organic Baby Spinach', 'Banana', 'Large Lemon'] occured in 371 baskets ['Bag of Organic Bananas', 'Organic Cucumber', 'Organic Hass Avocado'] occured in 366 baskets ['Organic Lemon', 'Bag of Organic Bananas', 'Organic Hass Avocado'] occured in 353 baskets ['Banana', 'Limes', 'Organic Avocado'] occured in 352 baskets antecedents consequents 0 (Bag of Organic Bananas, Organic Hass Avocado) (Organic Strawberries) 1 (Bag of Organic Bananas, Organic Strawberries) (Organic Hass Avocado) 2 (Organic Hass Avocado, Organic Strawberries) (Bag of Organic Bananas) 3 (Bag of Organic Bananas) (Organic Hass Avocado, Organic Strawberries) 4 (Organic Hass Avocado) (Bag of Organic Bananas, Organic Strawberries) 5 (Organic Strawberries) (Bag of Organic Bananas, Organic Hass Avocado) Press any key to continue . . .

As you can see, these couple of lines of code could deliver us answers to the question regarding most frequent product subsets, as well as about association rules between groups of products frequently occurring together. I would encourage you to think about other interesting use cases where association analysis could provide you with incredible and profound insights.

——————-

**Reference for GUHA method:**

Mechanizing Hypothesis Formation. Mathematical Foundations for a General Theory; Petr Hájek, Tomáš Havránek, Springer-Verlag 1978 (ISBN 3-540-08738-9, 0-387-08738-9)