Description
Question 1 (50 points)
The file Groceries.csv contains market basket data. The variables are:
1. Customer: Customer Identifier
2. Item: Name of Product Purchased
The data is already sorted in ascending order by Customer and then by Item. Also, all the items bought by each customer are all distinct.
After you have imported the CSV file, please discover association rules using this dataset.
a) (10 points) Create a dataset which contains the number of distinct items in each customer’s market basket. Draw a histogram of the number of unique items. What are the median, the 25th percentile and the 75th percentile in this histogram?
b) (10 points) If you are interested in the k-itemsetswhich can be found in the market baskets of at least seventy five (75) customers. How many itemsetscan you find? Also, what is the largestk value among your itemsets?
c) (10 points) Find out the association rules whose Confidence metrics are at least 1%. How many association rules have you found? Please be reminded that a rule must have a non-empty antecedent and a non-empty consequent. Also, you do not need to show those rules.
d) (10 points) Graph the Support metrics on the vertical axis against the Confidence metrics on the horizontal axis for the rules you found in (c). Please use the Lift metrics to indicate the size of the marker.
e) (10 points) List the rules whose Confidence metrics are at least 60%. Please include their Support and Lift metrics.
Question 2 (50 points)
Apply the Spectral Clustering method to the Spiral.csv. Your input fields are x and y. Wherever needed, specify random_state = 60616 in calling the KMeans function.
a) (10 points) Generate a scatterplot of y (vertical axis) versus x (horizontal axis). How many clusters will you say by visual inspection?
b) (10 points) Apply the K-mean algorithm directly using your number of clusters that you think in (a). Regenerate the scatterplot using the K-mean cluster identifier to control the color scheme?
c) (10 points) Apply the nearest neighbor algorithm using the Euclidean distance. How many nearest neighbors will you use? Remember that you may need to try a couple of values first and use the eigenvalue plot to validate your choice.
d) (10 points) Retrieve the first two eigenvectors that correspond to the first two smallest eigenvalues. Display up to ten decimal placesthe means and the standard deviation of these two eigenvectors. Also, plot the first eigenvector on the horizontal axis and the second eigenvector on the vertical axis.
e) (10 points) Apply the K-mean algorithm on your first two eigenvectors that correspond to the first two smallest eigenvalues. Regenerate the scatterplot using the K-mean cluster identifier to control the color scheme?