## Description

Association Analysis, Data Preprocessing, and Clustering

1 Association Rules (30 points)

The Data

For this portion of the assignment you will be using the Extended Bakery dataset, which describes

transactions from a chain of bakery shops that sell a variety of drinks and baked goods. More

information about the original unmodified dataset can be found here.

You may download the dataset here.

This particular dataset is made up of smaller subsets that contain 1,000, 5,000, 20,000 and

75,000 transactions each. Furthermore, each of these subsets is given in two different formats.

You are free to use either one when writing your solutions. Below is a description of each:

1. Sparse Vector Format: XXX-out1.csv files. Each line of the file has the following format:

• First column: transaction ID

• Subsequent columns: list of purchased goods (represented by their ID code)

Example:

3,0,2,4,6

Transaction 3 contained items 0, 2, 4, 6

2. Full Binary Vector Format: XXX-out2.csv files. Each line has the following format:

• First column: transaction ID

• 50 subsequent columns: A sequence of binary values that indicate if item i (represented in column i) has been purchased in that transaction.

Example:

3,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Transaction 3 contained items 0, 2, 4, 6

The sparse vector format can be easily read into a list of transactions in Python by using the

following code snippet:

T = [ ] # l i s t o f t r a n s a c t i o n s

i n f i l e = open ( ’ /path/ t o / f i l e / f i l e . c s v ’ , ’ r ’ )

f o r l i n e i n i n f i l e :

# Append new t r a n s a c t i o n ( s ki p pi n g \n and t r a n s a c t i o n number )

T. append ( l i n e . s t r i p ( ) . s p l i t ( ’ , ’ ) [ 1 : ] )

The Idea: Association Rule Learning

Given the database of transactions, where each transaction is a list of items, find rules that

associate the presence of one set of items with that of another set of items. Ideally, we only

want to find rules that are substantiated by the data; we want to avoid spurious associations.

What to Do

Table 1.1 provides a map between the product IDs and the actual items. Your task is to find

the association rules that exceed specific values of minimum support and minimum confidence.

You are free to experiment with different values until you find something that produces meaningful/interesting results.

Recall from our class discussions that this process requires two steps, namely finding the frequent

itemsets and discovering the strong association rules within these. To do so, you may re-use

2

some of the code that we provided for Apriori1

, write your own code, or use any Python modules

that offer that functionality. Whatever option you choose to follow, be sure that your output

is formatted in a clear and legible manner.

What to Provide

Your output should contain the following:

• For each frequent itemset:

1. All items in it, described by the product names (the products.csv file may be

helpful on mapping IDs → names).

2. The support of the itemset.

• For each rule:

1. All items on the left side of the rule.

2. All items on the right side of the rule.

3. The support of the rule.

4. The confidence of the rule.

Given this output, respond to the following question:

1. Compare the rules that you obtained for each different subset (1,000–75,000 transactions).

How does the number of transactions affect the results you observed?

ID Product

0 Chocolate Cake

1 Lemon Cake

2 Casino Cake

3 Opera Cake

4 Strawberry Cake

5 Truffle Cake

6 Chocolate Eclair

7 Coffee Eclair

8 Vanilla Eclair

9 Napoleon Cake

ID Product

10 Almond Tart

11 Apple Pie

12 Apple Tart

13 Apricot Tart

14 Berry Tart

15 Blackberry Tart

16 Blueberry Tart

17 Chocolate Tart

18 Cherry Tart

19 Lemon Tart

ID Product

20 Pecan Tart

21 Ganache Cookie

22 Gongolais Cookie

23 Raspberry Cookie

24 Lemon Cookie

25 Chocolate Meringue

26 Vanilla Meringue

27 Marzipan Cookie

28 Tuile Cookie

29 Walnut Cookie

ID Product

30 Almond Croissant

31 Apple Croissant

32 Apricot Croissant

33 Cheese Croissant

34 Chocolate Croissant

35 Apricot Danish

36 Apple Danish

37 Almond Twist

38 Almond Bear Claw

39 Blueberry Danish

ID Product

40 Lemonade

41 Raspberry Lemonade

42 Orange Juice

43 Green Tea

44 Bottled Water

45 Hot Coffee

46 Chocolate Coffee

47 Vanilla Frappuccino

48 Cherry Soda

49 Single Espresso

Table 1.1: Products in the dataset and their corresponding codes.

1http://nbviewer.ipython.org/github/cse40647/cse40647/blob/sp.14/10%20-%20Apriori.ipynb

3

2 Data Preprocessing (30 points)

The Data

For this portion of the assignment you will be using a set of images of the University of Notre

Dame campus. The dataset consists of 30 color images capturing the campus during the spring,

fall, and winter (10 images for each season). Each image consists of 62,500 pixels (250×250

pixels). Each pixel is defined by three color values varying between 0 and 255 that specify the

degree or red, green, and blue present at that point.

You may download the dataset here.

The Idea: PCA with Image Histograms

Your objective here will be to perform dimensionality reduction on the dataset to view the

similarity of the images within some lower-dimensional (feature) space. Specifically, we would

like to perform a transformation on the images so that they can be viewed in a 2-dimensional

space, which we can then visualize via a plot. This can be accomplished by performing principal

component analysis (PCA) on some property of the images and retaining only the top 2 principal

components. Ideally, we want to apply PCA to a property of the images that can be used to

capture some notion of the “similarity” between images.

Intuitively, since we typically view an image as a collection of pixels, we might consider performing PCA on the set of pixels, reducing an image from a collection of 62,500 pixel values

to a new 2-dimensional space. In other words, we would consider each pixel as a feature of an

image, and try to transform an image into a new set of 2 features. Ideally, images that have a

similar set of pixels should also be similar in this 2-dimensional space (i.e., they should share

similar feature values). However, due to the magnitude of this reduction, a significant amount

of information would likely be lost in the transformation, meaning that pictures with similar

pixel values many not appear similar in the the new 2-dimensional space.

A better approach might be to reduce the histogram of color values. Each color has 256 possible

values, resulting in a histogram of 768 (256*3) color values distributed over the entire range

of 65,000 pixels. Each value of the histogram is the number of pixels in the image with the

corresponding color value. Here, we would consider each histogram value as a feature of an

image. By performing PCA on an image histogram, we are only reducing an image from a

set of 768 unique histogram values (rather than the 65,000 unique pixel values) to a new 2-

dimensional space. As a result of this transformation, images that have a similar histogram

should also be similar in this 2-dimensional space.

What to Do

The IPython Notebook provided with this assignment includes functions to compute the histograms and plot the images within the transformed (2-dimensional) space (load_images and

plot_image_space, respectively). There are also functions to generate and plot the color

palettes associated with each image (cluster_image_colors and plot_color_palette,

respectively); the palettes are generated via (k-means) clustering of the pixel color values, and

may be investigated at your own leisure—they are not needed to complete the assignment.

The images can be loaded and the histograms generated by running the following code snippet

(which imports pandas as pd). Please ensure that the directory provided to the load_images

function is correct. For example, if you have placed all the images in your base IPython

Notebook directory in a folder labeled images, with the campus images in a subfolder labeled

campus, then the (relative) path to the campus images would be images/campus.

4

import pandas a s pd

# Load images and g e n e r a t e c o l o r hi s t o g r am s

images = l o a d im a g e s ( ’ /path/ t o / campus images ’ )

X = pd . DataFrame ( [ im . hi s t o g r am ( ) f o r im i n images ] )

Once you have a DataFrame of image histograms, the PCA transformation of this data can

be computed and plotted in Python by running the following code snippet (which imports

decomposition from scikit-learn):

from s kl e a r n import d e c om p o si ti o n

# Generate PCA t r a n s f o rm a ti o n and pl o t the r e s u l t s

e s tim a t o r = d e c om p o si ti o n .PCA( n components=2)

X p r o j = e s tim a t o r . f i t t r a n s f o r m (X)

pl o t im a g e s p a c e ( images , X pro j , ’ t i t l e o f p l o t ’ )

The plot generated by this code snippet will display the images according to the top 2 principal

components (or eigenvectors) found by the PCA transformation of the image histogram.

What to Provide

Your output should contain the following:

• The PCA projection of the image color histograms in 2 dimensions. Using the provided

plot_image_space function, this should be displayed as thumbnail images distributed

within a 2-dimensional plot.

Given this output, respond to the following questions:

1. What does it mean for two images to be close together in this plot? What does it mean

for two images to be far apart?

2. Do images corresponding to one of the seasons tend to group together more closely than

others? Why might this be the case?

5

3 Clustering (40 points)

The Data

For this portion of the assignment you will be using the Iris flower dataset. The dataset consists

of 50 samples from each of three species of Iris, with four features measured from each sample.

scikit-learn provides a function to load the dataset (no download required).

The Idea: Choosing k for k-means

Your objective here will be to assess the performance of k-means clustering on the Iris dataset.

Recall that the number of clusters, k, is an input parameter to the k-means algorithm. A variety

of measurements are available for estimating the optimal value of k. For this assignment, you

will look at the sum of squared deviation (SSQ) and the gap statistic. Both of these criteria

make use of the intuition that k-means tries to minimize variance (the distance or deviation of

each point from each of the k clusters) by iteratively assigning points to their nearest clusters.

Choosing k with SSQ

The SSQ criterion is a direct application of the intuition that k-means tries to minimize variance.

Recall that the SSQ criterion sweeps over a range of possible k values, with each value of k

associated with a degree of deviation (the distance of each point from each of the k clusters).

These deviations can be squared and summed to arrive at the “sum of squared deviation” (SSQ)

for each value of k. Larger values of k are expected to continue to reduce the SSQ (because there

are more clusters for points to be near, reducing their deviation). However, one could expect a

leveling-off in the SSQ once the value of k exceeds the true number of clusters, as this would

result in true clusters (that is, clusters actually present in the data) being separated. If, then,

one plots the SSQ over a range of k values, this leveling-off point may produce a noticeable

“elbow” in the plot. By this criterion, the estimated optimal value of k is that which occurs at

this elbow point. While simple, the difficulty with this criterion is that often the elbow point is

not distinctive or well-defined.

Choosing k with the Gap Statistic

The gap statistic provides a criterion that produces a quantifiable estimate of the optimal value

of k over a range of possible k values. The intuition here is that there is an expected degree of

deviation associated with clustering any given dataset. We want the number of clusters, k, that

displays the largest “gap” between the deviation we expect, given the dataset and the number

of clusters, and the deviation we estimate or observe. Thus, rather than simply considering

estimated deviation by itself, we can standardize the estimated deviation for a possible value of

k by comparing it with the expected deviation under an appropriate null reference distribution of

the data (e.g., a uniform distribution). This difference or gap between the expected deviation

and the estimated deviation is termed the gap statistic. Maximizing this gap statistic then

corresponds to minimizing the estimated deviation relative to what would be expected. To

ensure that we do not needlessly posit additional clusters (i.e., larger values of k), we only

consider the value k+1 if its gap statistic (minus any measurement error) is higher than that

for k. By this criterion, the lowest value of k with a corresponding gap statistic higher than or

equal to the gap statistic of k+1 is the estimated optimal value of k.

What to Do

The provided assignment IPython Notebook includes functions to compute and plot the gap

statistic (gap_statistics and plot_gap_statistics) and the sum of squared deviation

(ssq_statistics and plot_ssq_statistics).

The Iris flower dataset can be loaded in Python by using the following code snippet (with

6

datasets imported from scikit-learn):

from s kl e a r n import d a t a s e t s

# Load the I r i s fl o w e r d a t a s e t

i r i s = d a t a s e t s . l o a d i r i s ( )

data = i r i s . data

Then the sum of squared deviations (SSQ) can be easily run and the results plotted by using

the following code snippet:

# Generate and pl o t the SSQ s t a t i s t i c s

s s q s = s s q s t a t i s t i c s ( data , ks=r an ge ( 2 , 1 1+ 1 ) )

p l o t s s q s t a t i s t i c s ( s s q s )

Similarly, the gap statistic can be easily run and the results plotted by using the following code

snippet:

# Generate and pl o t the gap s t a t i s t i c s

gaps , e r r s , d i f s = g a p s t a t i s t i c s ( data , n r e f s =20, ks=r an ge ( 2 , 1 1+ 1 ) )

p l o t g a p s t a t i s t i c s ( gaps , e r r s , d i f s )

Both the SSQ and gap statistic code snippets require a variable ks, which defines the range of

k values (the “ks”) to evaluate. For example, if you would like to evaluate k values between 2

an 11 (inclusive), you could set ks as range(2,11+1).

The function plot_gap_statistics generates two plots. The first plot simply displays the

gap statistic for each k value evaluated. The second plot displays the difference between the

gap statistic for value k and that computed for value k+1. On the second plot, the first nonnegative value, or gap difference, is the optimal number of clusters estimated by the gap statistic

criterion.

What to Provide

Your output should contain the following:

• The SSQs computed for k values between 1 and 10 (inclusive). There should be one plot

corresponding to the SSQs.

• The gap statistics computed for k values between 1 and 10 (inclusive). There should be

two plots corresponding to the gap statistics.

Given this output, respond to the following questions:

1. Where did you estimate the elbow point to be (between what values of k)? What value

of k was typically estimated as optimal by the gap statistic? To adequately answer this

question, consider generating both measures several times, as there may be some amount

of variation in the value of k that they each estimate as optimal.

2. How close are the estimates generated by the elbow point and gap statistic to the number

of species of Iris represented in the dataset?

3. Assuming we are trying to generate one cluster for each Iris species represented in the

dataset, does one measure seem to be a consistently better criterion for choosing the value

of k than the other? Why or why not?

7

4 Graduate Student Portion

(20 points / +10 points for undergraduates)

The Data

We know what can improve this assignment: cute kittens. So for this portion of the assignment

you will be using a set of (cute) kitten images. The dataset consists of 25 color images, each

preprocessed so that the kittens’ faces are similar in size and orientation. Each image consists

of 62,500 pixels (250×250 pixels).

You may download the dataset here.

The Idea: Dimensionality Reduction in Face Recognition

Your job, if you accept it (and if you’re a graduate student, you must), is to generate face

clusters (via k-means clustering) and eigenfaces (via PCA) from this set of kitten face images.

Generating Clusters of Faces

Intuitively, if we wanted to reduce the number of images in a dataset, we could try to cluster

these images into groups. If each group is defined by a cluster centroid (that is, an image that

represents all members of the cluster) as in k-means clustering, then we could simply retain

these centroid images and discard the remaining ones. In this way, we are considering each

image as a dimension, and reducing the dimensionality of the set of images, much to the same

effect as PCA. A drawback of this method, however, is that there may not be any natural

clustering of the images, resulting in a potentially significant loss of information if we retain

only the centroid images.

Generating Eigenfaces

Recall that the Eigenface approach to facial recognition is the use of PCA to reduce a collection

of face images to a smaller set of reference images. The term typically refers to the use of

this approach on human faces, but the same concept is applicable to kitten faces as well. The

intuition here is that the eigenfaces will constitute a set of “standardized face ingredients,”

which may be used to match subsequent face images. We are considering each image as a

dimension, and reducing the dimensionality of the set of images.

Note that both the face clustering and Eigenface approaches perform dimensionality reduction

on the set of images (that is, we reduce the number of images, where each image is a dimension).

This contrasts with performing dimensionality reduction on individual images by reducing the

set of pixels to a smaller set of component features (where each pixel is a dimension).

What to Do

The provided assignment IPython Notebook includes functions to load and plot the (kitten)

images (load_images and plot_gallery, respectively).

The kitten images can be easily loaded in Python by using the following code snippet, which

makes use of our provided load_images function (and assumes that NumPy has been imported

as np). Please ensure that the directory provided to the load_images function is correct.

For example, if you have placed all the images in your base IPython Notebook directory in a

folder labeled images, with the kitten images in a subfolder labeled cute_kittens, then the

(relative) path to the kitten images would be images/cute_kittens.

8

# Load an a r r a y o f k i t t e n images

k i t t e n s = l o a d im a g e s ( ’ /path/ t o / ki t t e n im a g e s ’ , g r a y s c a l e=True )

k i t t e n s = np . a r r a y ( k i t t e n s ) # t r a n s f o rm t o a NumPy a r r a y

n sample s , n f e a t u r e s = k i t t e n s . shape # number o f rows and columns

im a ge sh ape = ( 1 0 0 , 1 0 0 ) # the image dimen si on s ( width and h ei g h t )

Before performing dimensionality reduction (our objective in generating clusters and eigenfaces),

some additional data preprocessing is needed. Generally, dimensionality reduction techniques

require centering the data globally (i.e., subtracting the mean from each data point for each

feature), because otherwise they may capture spurious fluctuations in the data. Though not

strictly needed, we may also center the data locally (i.e., subtracting the mean from each data

point for each image) to enhance the visualization of the images.

The following code snippet centers the kitten data and displays the first few images in the

centered dataset by making use of our provided plot_gallery function. The snippet assumes that the number of retained components (clusters or eigenfaces) has been defined as

n_components (an integer) and that the number of columns and rows of images to display

has been defined as n_col and n_row (also integers), respectively. Note that n_components

must equal the product of n_col and n_row, or you may otherwise receive an error.

# Gl ob al c e n t e r i n g

k i t t e n s c e n t e r e d = k i t t e n s − k i t t e n s . mean ( a x i s =0)

# L oc al c e n t e r i n g

k i t t e n s c e n t e r e d −= k i t t e n s c e n t e r e d . mean ( a x i s =1). r e s h a p e ( n sample s ,−1)

# Pl o t the f i r s t few c e n t e r e d k i t t e n images

a s s e r t n components == n c o l ∗n row

p l o t g a l l e r y ( ” t i t l e o f p l o t ” , k i t t e n s [ : n components ] , n c ol , n row )

The following code snippet performs k-means clustering on the centered kitten data, thus generating face clusters, and plots the results (with the number of clusters defined as n_components

and the number of columns and rows of images to display defined as n_col and n_row, respectively):

# Compute k−means c l u s t e r i n g on the d a t a s e t

e s tim a t o r = s kl e a r n . c l u s t e r . KMeans ( n c l u s t e r s=n components , \

t o l =1e −3, m a x i t e r =50)

e s tim a t o r . f i t ( k i t t e n s c e n t e r e d )

# Pl o t the r e s u l t i n g k i t t e n f a c e c l u s t e r s

a s s e r t n components == n c o l ∗n row

p l o t g a l l e r y ( ” t i t l e o f p l o t ” , \

e s tim a t o r . c l u s t e r c e n t e r s [ : n components ] , n c ol , n row )

By using the following code snippet, PCA can be performed on the centered kitten data, thus

generating eigenfaces, and results plotted (with the number of retained components defined as

n_components and the number of columns and rows of images to display defined as n_col

and n_row, respectively). Note that, while we have already explicitly centered the data, the

PCA function used here actually performs global centering for us.

9

# Compute PCA on the d a t a s e t

e s tim a t o r = d e c om p o si ti o n . RandomizedPCA ( n components=n components , \

whiten=True )

e s tim a t o r . f i t ( k i t t e n s c e n t e r e d )

# Pl o t the r e s u l t i n g PCA e i g e n f a c e s

a s s e r t n components == n c o l ∗n row

p l o t g a l l e r y ( ” t i t l e o f p l o t ” , \

e s tim a t o r . components [ : n components ] , n c ol , n row )

What to Provide

Your output should contain the following:

• The face clusters computed for at least 2 values of n_components. The number of

displayed face clusters should correspond to this value.

• The eigenfaces computed for at least 2 values of n_components. The number of displayed

eigenfaces should correspond to this value.

Given this output, respond to the following questions:

1. For each eigenface, darker regions indicate pixels that contribute more to that eigenface

and lighter regions indicate pixels that contribute less; in this sense, the further a feature

is from a neutral color—in this case gray—the more that feature is “captured” by the

eigenface. Briefly describe the sort of cute kitten facial features captured (i.e., those facial

features represented by noticeably lighter or darker tones) by the first few eigenfaces. As

the first eigenvector should capture the most variance, the first eigenface should correspond

to the most general facial features.

2. How do the face clusters (generated by k-means clustering) compare to the eigenfaces

(generated by PCA)? As the kitten images have no obvious clusterings, it may be more

difficult to interpret the facial features to which each cluster corresponds.

10