# CS412: Introduction to Data Mining Assignment 4

\$30.00

5/5 - (1 vote)

## 1 Constraint pattern mining (5 points)

For the following constraints, write down your answer if they are (strongly convertible,
convertible) Anti-monotone, Monotone or Succinct.

Purpose
• Have a better understanding of constraint pattern mining.

Requirements
• A short explanation to justify if the constraint belongs to EACH type is required. To
show a constraint is not of one type, use a simple example.
a. (1’, L1) for a set of values S, and a value v, constraint v 2 S
1
2
b. (1’, L1) for a set of values S, and a value v, constraint max(S) v
c. (1’, L1) for a set of values S, and a value v, constraint max(S)  v
d. (2’, L2) for a set of values S, and a value v, constraints avg(S) v and avg(S)  v

## 2 Advanced pattern mining (10 points)

For each question below, indicate if the statement is true or false.
Purpose
• Have a better understanding of advanced pattern mining.

Requirements
• For each sub-question, select true (T) or false (F) and also write down a brief
explanation. You will NOT get credit without an explanation.
a. (2’, L2) T/F. Convertible constraint porperty cannot be exploited in an Apriori-based
mining algorithm.

b. (2’, L2) T/F. In PrefixSpan, physical project is not used because of its slow performance.
c. (2’, L1) T/F. Mining closed frequent graphs only is lossless compression of the graph
database.

d. (2’, L3) T/F. In sequential pattern mining, the number of length-2 candidates generated from x frequent length-1 patterns is 3
2×2 1
2x
e. (2’, L3) T/F. For nontrivial constraints (trivial constraints are those that are satisfied
by every possible pattern), it cannot be monotone and anti-monotone at the same time.

## 3 Sequential pattern mining (20 points)

Use a toy dataset to perform sequential pattern mining algorithms.
Purpose
• Work on sequential pattern mining using GSP and PrefixSpan

Requirements
• Write down each step as detail as possible.
Suppose a toy sequence database D contains three sequences as follows. Let the minimum
support be 3.
customer id shopping sequence
1 (bc)(de)f
2 bcdef
3 (bc)dbegf

The following questions require you to perform GSP algorithm.

(1) (2’, L2) Scan database once, list length-1 sequential pattern candidates C1 and the
result L1 after pruning.
(2) (3’, L2) Following, generate C2, and L2. Hint: do not miss any candidates in C2
(3) (3’, L3) Now, generate C3, L3, C4, L4 and longer candidates/results until the algorithm
terminates.

Now, Using the same DB, apply PrefixSpan to get the same results:
(4) (2’, L2) Scan database once, and get the length-1 prefix list P1.
(5) (3’, L2) For each prefix in P1, generate its projected database.
(6) (3’, L3) complete the PrefixSpan algorithm from the result above. For each step, list
the prefix and its corresponding projected DB.

Comparing the two algorithms, answer the following questions:
(7) (2’, L3) What is the major di↵erence between the two algorithms?
(8) (2’, L3) Based on the above analysis, can you tell which one will outperform the other?
And in what contidition/situation particularly?

## 4 Decision Trees (18 points)

ID3 is a simple algorithm for decision tree construction using information gain. The steps
of the ID3 algorithm are similar to those introduced in the lecture. In particular, ID3 uses
information gain to select decision attributes, and each attribute is used at most once in any
root-to-leaf path in the decision tree.

You will use ID3 to build a decision tree that predicts
whether a candidate will be accepted to the PhD program of some University X, given the
student’s information about GPA, university, publications, and recommedation.

Purpose
• Understand and practice basic decision tree construction, calculation of information
gain measures, and classifier evaluation.
Requirements
• Show the calculations for selecting the decision tree attributes and the labels for each
leaf.

a. (6’, L2) Using the ID3 algorithm to construct a decision tree using the training data
in Table 1. When multiple attributes has best information gain, choose the one whose
name appears earliest in alphabetical order. When there is a tie for the majority labels,
choose no. Show the final decision tree, and the calculations to derive that tree.

id GPA univ published recommendation accepted
1 4.0 top-10 yes good yes
2 4.0 top-10 no good yes
3 4.0 top-20 no normal yes
4 3.7 top-10 yes good yes
5 3.7 top-20 no good yes
6 3.7 top-30 yes good yes
7 3.7 top-30 no good no
8 3.7 top-10 no good no
9 3.5 top-20 yes normal no
10 3.5 top-10 no normal no
11 3.5 top-30 yes normal no
12 3.5 top-30 no good no

Table 1: Training Data for the decision tree problem
id GPA univ published recommendation accepted
1 4.0 top-10 yes good yes
2 3.7 top-30 yes good yes
3 3.5 top-30 yes good yes
4 3.7 top-10 no good no
5 3.5 top-30 no good no
Table 2: Testing Data for the decision tree problem

b. (4’, L2) Evaluate your constructed decision tree using the testing data in Table 2 in
terms of the precision and recall for the class yes. Show your calculations.
c. (4’, L2) What is the worst case time complexity of training a decision tree using ID3
on a dataset with n data records and m attributes each having p possible values? Show

d. (4’, L3) Each root-to-leaf path in any decision tree can be converted into a rule, such
as a path A1
=T rue
! A2
=F alse
! class = +1 can be converted to the rule “If attribute
Ai is true and attribute A2 is false, then the instance has class +1”.

the following:
1. Generate the rules for each leaf of your constructed decision tree.
2. Is it possible to construct a decision tree from a set of rules? Explain your answer.

You will be guided through the steps of building an ensemble classifier using AdaBoost. The
data points to be classified are given in Table 3. Each classifier in the ensemble will have
one of the following forms:

• If x>a, label +1, else label -1
• If x a, label +1, else label -1
• If y<b, label +1, else label -1
• If y  b, label +1, else label -1

where a and b are constants for you to figure out. That is, the hypothesis of the classifier can
be represented by a line parallel to the y-axis or the x-axis. While the original AdaBoost
algorithm trains each base classifier on sampled data points, you will simulate AdaBoost
(deterministically) by picking each base classifier given all the data points such that the base
classifier minimizes the weighted error rate.

id x y label
1 1.0 0.5 +1
2 2.2 1.0 +1
3 2.7 2.0 +1
4 0.5 1.5 -1
5 1.2 2.3 -1
6 1.5 2.7 -1

Table 3: Data points for the AdaBoost problem
Purpose
• Understand and practice AdaBoost algorithm by walking through the steps.

Requirements
• Show all the steps and calculations needed to derive each classifier.
• In case of ties when selecting classifiers, pick one that corresponds to a line parallel to
the y-axis; if the tie still exists, pick one with minimum a or b.

a. (2’, L2) Assume that data weight distribution D1 in Round 1 is uniform. Find classifier
h1 that has minimum weighted error with data weight distribution D1. (Note: see
requirements for breaking ties when choosing from equally good classifiers).

b. (2’, L2) What is the weighted error rate of classifier h1 with data weights D1?
c. (2’, L2) After re-weighting the data according to the results from Round 1, what is the
updated data weight distribution D2 for Round 2? Normalize the weights so that they
sum to 1.

d. (2’, L2) Find classifier h2 for Round 2 that have the minimum weighted error rate for
the data weight distribution D2.

e. (4’, L2) Similar to (c) and (d), compute the weight distribution D3 and find a classifier
h3 for Round 3.

f. (4’, L3) What is the ensemble classifier h0 that combines h1, h2, and h3? Show h0 by
plotting the given data points in a 2-D plane and highlighting the regions where points
would be classified as +1 by h0
.

## 6 Bayes Classifier (16 points)

Using the same training data as in Table 1, your will train a classifier using the Naive Bayes
method.

Purpose
• Understand and practice the principles of Naive Bayes classifier and its training algorithm; compare the trained classification models to see their pros/cons.

Requirements
• Show the steps and calculations to derive the classifier.
• Show the formulas you used to calculate the results.
a. (1’, L2) What is the prior probability of accepted being yes/no estimated from the
data?

b. (3’, L2) What is the conditional probability of attribute in GPA taking each of the values
in {4.0, 3.7, 3.5}, given accepted=yes? Also calculate the conditional probabilities for
each of attributes {university, published, recommendation} taking each of its
possible values given that accepted=yes.

c. (3’, L2) Calculate similiar conditional probabilities asked in (b) with the conditions
replaced by accepted=no.
d. (3’, L2) Based the results you got from (a)-(c), given a student with attributes (GPA=3.7,
university=top-20, published=yes, recommedation=good), calculate the probability of the student being accepted. What will the probability become if the student
has (GPA=3.7, university=top-30, publication=no, recommendation=normal)?

e. (2’, L2) Consider a traning dataset with n tuples and m attributes, and assume each
attributes can take k possible values. What is the time complexity of training a Naive

f. (4’, L3) Discuss the pros and cons of the classification models have trained using
decision trees and Naive Bayes (Name one pro and one con for each model).

## 7 Frequent Subgraph Pattern Mining (15 points)

gSpan is an algorithm for mining frequent graph patterns. On the course page, download
the gSpan package and complete the following questions.
Purpose
• Work on graph mining using gSpan
• Learn how to utilize gSpan package by converting graph and pattern into files of correct
format.

Requirements
• For each question, draw all the pattern graphs as given from the gSpan algorithm. Like
what we did in class, we start with a tree of empty root node and generate patterns
by adding new nodes as children of the pattern tree.

Here is three molecular formulas of acids that we are interested in. Use gSpan to do the
following:
(1) (5’, L2) Practice the algorithm by hand to find ALL patterns that appear in all three
molecular structures. Draw the pattern grow tree and list the pattern and its frequency
of each node. For this question, it is fine to draw the tree by hand and take
a picture of it and paste in the answer. (Hint: the frequency is counted by the
number of molecular structures it appears, not by its number of occurrences. E.g., the
frequency of O-H is 3, though it appears four time.)

(2) (5’, L2) Now, use the gSpan package to find ALL patterns that appear in at least
two molecules. Note that the executable binary we have is compiled on Linux. You
will need to use a Linux machine (such as EWS) to run the algorithm (though you can
hack a little bit to run Linux binary on Mac by emulating — do a Google search for
related results.)

(3) (5’, L2) Find the closed patterns from the result of (2). Considering that sulfuric acid
is a typical inorganic acid while acetic acid is organic, what can you say that determins
the acidity of organic/inorganic compound in general?