The Apriori algorithm is a method used in data mining to find patterns or associations in large datasets. Imagine you have a list of items that people buy in a store, and you want to find out which items are often bought together. This is called association rule mining, and the Apriori algorithm helps you discover these associations.
Steps of the Apriori Algorithm
- Initialize:
- Start with the list of all individual items in the dataset.
- Set the minimum support threshold (min_sup) and minimum confidence threshold (min_conf).
- Generate Frequent Itemsets:
- Step 1: Find Frequent 1-itemsets:
- Calculate the support for each individual item.
- Keep only those items whose support is greater than or equal to min_sup.
- Step 2: Generate Candidate k-itemsets (k ≥ 2):
- Combine the frequent (k-1)-itemsets to generate candidate k-itemsets.
- Calculate the support for each candidate k-itemset.
- Keep only those candidate k-itemsets whose support is greater than or equal to min_sup.
- Step 1: Find Frequent 1-itemsets:
- Repeat:
- Continue generating candidate k-itemsets and pruning infrequent ones until no more frequent itemsets can be found.
- Generate Association Rules:
- For each frequent itemset, generate all possible non-empty subsets.
- For each subset (antecedent), calculate the confidence of the rule (antecedent ⇒ consequent).
- If the confidence of the rule is greater than or equal to min_conf, keep the rule.
Let’s understand this as if we are a 15-year-old...
Let’s say we have a store and we want to analyze customer transactions. This is like finding patterns in data. The Apriori algorithm is a tool that helps you find these patterns.
Here are some example transactions:
- Transaction 1: [Milk, Bread, Butter]
- Transaction 2: [Bread, Butter]
- Transaction 3: [Milk, Bread]
- Transaction 4: [Milk, Butter]
Step 1: Generate Candidate Itemsets / Find Frequent 1-itemsets
- Look at individual items first:
- Milk: bought 3 times
- Bread: bought 3 times
- Butter: bought 3 times
- Calculate support for each item:
- Support(Milk) = 3/4 = 0.75
- Support(Bread) = 3/4 = 0.75
- Support(Butter) = 3/4 = 0.75
- Assume min_sup is 0.5. All items meet this threshold, so we keep all 1-itemsets. Since all items are bought frequently, we keep all of them.
Step 2: Generate Candidate 2-itemsets
- Now, create pairs of items:
- [Milk, Bread]: bought 2 times
- [Milk, Butter]: bought 2 times
- [Bread, Butter]: bought 2 times
- Calculate support for each 2-itemset:
- Support({Milk, Bread}) = 2/4 = 0.5
- Support({Milk, Butter}) = 2/4 = 0.5
- Support({Bread, Butter}) = 2/4 = 0.5
All 2-itemsets meet the min_sup threshold. All pairs are frequent, so we keep them.
Step 3: Generate Larger Itemsets / Candidate 3-itemsets
- Create triplets of items:
- [Milk, Bread, Butter]: bought 1 time
- Calculate support for the 3-itemset:
- Support({Milk, Bread, Butter}) = 1/4 = 0.25
Since this support is below the min_sup threshold, we discard {Milk, Bread, Butter}. Since the triplet is not bought frequently, we prune it.
Step 4: Generate Association Rules
For each frequent itemset, generate rules and calculate confidence:
- For {Milk, Bread}:
- Rule: Milk ⇒ Bread
- Confidence = Support({Milk, Bread}) / Support(Milk) = 0.5 / 0.75 = 0.67
- Rule: Bread ⇒ Milk
- Confidence = Support({Milk, Bread}) / Support(Bread) = 0.5 / 0.75 = 0.67
- Rule: Milk ⇒ Bread
- For {Milk, Butter}:
- Rule: Milk ⇒ Butter
- Confidence = Support({Milk, Butter}) / Support(Milk) = 0.5 / 0.75 = 0.67
- Rule: Butter ⇒ Milk
- Confidence = Support({Milk, Butter}) / Support(Butter) = 0.5 / 0.75 = 0.67
- Rule: Milk ⇒ Butter
- For {Bread, Butter}:
- Rule: Bread ⇒ Butter
- Confidence = Support({Bread, Butter}) / Support(Bread) = 0.5 / 0.75 = 0.67
- Rule: Butter ⇒ Bread
- Confidence = Support({Bread, Butter}) / Support(Butter) = 0.5 / 0.75 = 0.67
- Rule: Bread ⇒ Butter
Assume min_conf is 0.6. All these rules meet the confidence threshold, so we keep them.
What is Support?
Support is a measure of how frequently an item or a set of items appears in the dataset. It’s an important concept because it helps us identify which items or itemsets are common and worth looking at.
How to Calculate Support
The support of an item or itemset is calculated by dividing the number of transactions that include the item or itemset by the total number of transactions.
Support Formula:
Support=Number of transactions containing the item or itemset/Total number of transactions
Why is Support Important?
Support helps us identify which items or itemsets are frequently bought together. If the support of an itemset is high, it means that the itemset is common and might be important for making business decisions, such as promotions or product placement.
In the Apriori algorithm, we often set a minimum support threshold to focus only on itemsets that are sufficiently frequent. For example, if we set the minimum support threshold to 0.5 (or 50%), we will only consider itemsets that appear in at least 50% of the transactions.
What is Confidence?
Confidence is a measure used to evaluate the strength of an association rule. It tells us how often items in the rule appear together, given the occurrence of the antecedent (the “if” part of the rule).
How to Calculate Confidence
Confidence is calculated by dividing the support of the combined itemset (both antecedent and consequent) by the support of the antecedent itemset.
Confidence Formula: Confidence(A⇒B) = Support(A∪B)/ Support(A)
Where:
- A is the antecedent (the “if” part).
- B is the consequent (the “then” part).
- A∪BA \cup BAB is the combined itemset of both A and B.
Why is Confidence Important?
Confidence helps us understand the reliability of an association rule. A high confidence value means that the rule is often true, indicating a strong relationship between the items.
In the Apriori algorithm, after generating candidate rules, we calculate the confidence for each rule and compare it to a predefined minimum confidence threshold. Rules that meet or exceed this threshold are considered strong and useful for decision-making.
Why is it Called “Apriori”?
The algorithm is called “Apriori” because it uses prior knowledge of the frequent itemset properties. This means that it assumes that all subsets of a frequent itemset must also be frequent. For example, if “Milk, Bread, and Butter” is frequent, then “Milk and Bread”, “Milk and Butter”, and “Bread and Butter” must also be frequent.
Summary
The Apriori algorithm is a powerful tool to find associations in large datasets by repeatedly generating and pruning itemsets based on their frequency. This helps businesses understand customer behavior and make better decisions.
Article written by Harshil Bansal, Team edSlash.