Description
1. Challenge: Automatic Classification of images via SVD decomposition
You will write an algorithm in MATLAB for the classification of handwritten digits. For this
you can download the following files from https://www.math.ucdavis.edu/~deloera/TEACHING/
MATH160/guessdigit-project.zip
NOTE: Data comes in compressed form, to open you type (after downloading), unzip guessdigitproject.zip. Inside the directory that will open up you will find 5 files:
ima2.m — Code Displays an image vector in the right orientation
azip.mat — the matrix of training digits data
dzip.mat — command dzip(i) tells you the (correct) digit you have in column
i of the matrix azip
dtest.mat — tells you the (correct answer) of test digits
testzip.mat — the test digits data
Use the training set, and compute the SVD of each class matrix (classes are those matrices that
represent the same digit). Use the first few (5, 10, 20) singular vectors as basis of a class and classify
unknown test digits according to how well they can be represented in terms of the respective bases
(use the relative residual vector in the least squares problem as a measure). Here are some specific
tasks.
• Write your code to do classification, it brakes the training data in classes, computes the SVD
of each class and uses that to make predictions. It takes in a test data point and makes a
prediction.
• Tune the algorithm for accuracy of classification. Give a table or graph of the percentage of
correctly classified digits as a function of the number of basis vectors. Graph the situation for
5, 10, 20 basis vectors. Display the results in a table (or tables).
• Check the singular values of the different classes. Is it reasonable to use different numbers
of basis vectors for different classes? If so, perform a few experiments to find out if it really
pays off to use fewer basis vectors in one or two of the classes (i.e., do you get different/same
outcome?).
• Check if all digits are equally easy or difficult to classify. Also look at one of the difficult ones,
and see that in many cases they are very badly written. What is the most difficult digit to read
for the computer? Does it help to increase the number of singular vectors you used? Write
comments at the very end of your code with your thoughts.
2. Challenge: Models for predicting the quality of wines.
In this problem we will use two types of convex-linear models to to predict wine quality (as judged
by enologists) from chemical measurements. The dataset you need to use is available at http:
//www.math.ucdavis.edu/~deloera/TEACHING/MATH160/winesinfo.csv. In each line, the first
11 columns contain the results from various chemical tests performed on the wine, and the last
column is the evaluation of how good the wine is (a score between 0 and 10).
First, consider a model based on Linear programming. For wine sample i, let us denote by yi ∈ R its
score and by xi ∈ R
11 its chemical properties. Construct a linear model to predict yi as a function
of xi
, that is, we want to find a ∈ R
11 and b ∈ R such that:
yi ‘ a
|xi + b.
The quality of the model will be evaluated using the `1 norm, i.e., we want to find a solution to
this optimization problem:
min
a∈R11
b∈R
1
n
Xn
i=1
|yi − a
|xi − b|.
a. Remember from class that the above problem is equivalent to the following linear program:
min
a∈R11
b∈R
z∈Rn
1
n
Xn
i=1
zi
s.t. zi ≥ yi − a
|xi − b, 1 ≤ i ≤ n
zi ≥ a
|xi + b − yi
, 1 ≤ i ≤ n
Explain how to rewrite this problem in matrix form:
min
d∈R12+n
c
|d
s.t. Ad ≤ b
In particular, give the dimensions and definitions of c, A and b.
b. Use an LP solver (again, we recommend using SCIP, MATLAB works too) to solve the above
problem and report your code as well as the optimal value of the problem. Note that the value
of the problem is exactly the average absolute error of the linear model on the dataset. Does
it seem to be within an acceptable range?
In this next part, you will re-use the dataset to fit a different linear model to predict wine quality
as a function of the chemical measurements, but we will use least-squares regression instead of
`1-regression.
a. Compute the optimal solution to the least-squares regression problem. Report your code, the
linear model (a and b) and the value of function RSS for this model.
b. Now sparsify the model to find the key 4 features of the wine that make it a good wine. What
are the top 4 features for deciding quality of wines according to a LASSO model?
3. Challenge: How to influence the opinion of people when on a budget (Facebook?).
It is in fashion to influence voters in elections. In this problem, you will consider a simplified model
of influence in social networks: the social network is represented by an undirected graph G = (V, E)
and for a set of nodes S ⊆ V , we denote by N(S) the set of their neighbors:
N(S) = {v ∈ V | ∃u ∈ S, (u, v) ∈ E}.
The influence I(S) of a set of nodes is measured by I(S) = |N(S)|.
Imagine you work now for facebook and you are given the dataset available at http://www.math.
ucdavis.edu/~deloera/TEACHING/MATH160/facebookgraph.txt. This dataset is in fact a subgraph of the Facebook social graph. Each line in the file contains the id of two users, indicating
that these two users are friends on Facebook.
a. As the designer of a marketing campaign to influence the opinion of voters, your goal is to find
a subset S ⊆ V of at most K nodes whose influence is maximal. Write a mathematical model
to solve this problem. Can you solve the problem for the data you were given using SCIP? If
not, can you write a practical method to give an approximation to the optimum? Extra bonus
if you can prove a guarantee of approximation.
b. Using any of your models/methods from part (a) write a computer function which, given the
social network described in the dataset and a budget K ∈ N
+, returns an approximately
optimal set of nodes S for the influence function I(S). The function should return both the
users to influence and the value (total amount of influence) obtained.
c. Plot the influence I(S) obtained by your function as a function of the budget K. E.g., what
happens when K = 1 and K = |V |?
d. In the definition of I(S) a node with largest degree is one of the most influential persons, i.e.,
he/she is a VIP. Can you think of a way to adapt the pagerank algorithm to identify VIP
nodes in the graph? Can you invent of a third additional way to define who are the VIPs in
this graph? What else could you use to measure influence?