Description
1. Wikipedia Wisconsin PageRank. A dataset was created by looking at the links
between Wikipedia pages with the word ‘Wisconsin’ in the title. The corresponding
graph contains 5482 nodes (Wikipedia pages) and 41,306 edges (links between the
pages).
Two files contain the data. The edges are store in a file with two columns of integers,
where the first column indicates the from node and the second column indicates the to
node. A second file contains the titles of the Wikipedia pages, and their integer value.
Use the starter script to load the edges file and create an adjacency matrix A, where
Ai,j = 1 if there is a edge from node j to node i, and zero elsewhere.
a) Write code to:
i. ensure there are no traps by adding 0.001 to each entry of A
ii. normalize A, and
iii. use an eigen decomposition to rank the importance of the Wikipedia pages.
b) What is the title of the page ranked 1st (i.e, the most important page)?
c) What is the title of the page ranked 3rd?
2. Data Fitting vs. Sparsity Tradeoff. This assignment uses the dataset BreastCancer.mat
to explore sparse regularization of a least squares problem. The journal article “A geneexpression signature as a predictor of survival in breast cancer” provides background
on the role of genes in breast cancer.
The goal is to solve the Lasso problem
w∗ = arg min
w∈Rn
kAw − dk
2
2 + λkwk1
Here w is the weight vector applied to the expression levels of 8141 genes and there
are 295 patients (feature sets and labels). In this problem we will vary λ to explore
the tradeoff between data-fitting and sparsity.
Scripts that implement iterative soft thresholding via proximal gradient descent to
solve the LASSO problem are available. The scripts use a hot start procedure for
finding the solution with different values for λ. The initial guess for the next value of
λ is the converged solution for the preceding value. This accelerates convergence when
subsequent values of λ lead to similar solutions.
1 of 2
a) Write code to find the optimal weights using only the first 100 patients (first 100
rows). Create a plot with the residual kAw∗ −dk2 on the vertical-axis and kw∗k1
on the horizontal-axis, parameterized by λ. In other words, create the curve by
finding w∗
for different λ, and plotting kw∗k1 vs. kAw∗ − dk2. Experiment with
λ to find a range that captures the variation from the least-squares solution (small
λ) to the all zeros solution (large λ). Appropriate values of λ may range from
10−6
to 20, spaced logarithmically. Explain the result.
b) Next use your solutions from part a) to plot the error rate on the vertical-axis
versus the sparsity on the horizontal-axis as λ varies. Define the error rate as the
number of incorrect predictions divided by the total number of predictions and
the sparsity as the number of nonzero entries in w∗
. For this purpose, we’ll say an
entry wi
is nonzero if |wi
| > 10−6
. Calculate the error rate using the training data,
the data used to find the optimal weights. Explain the result.
c) Repeat parts a) and b) to display the residual and error rate, respectively using
validation or test data, rows 101-295 of the data matrix, that is, the data not
used to design the optimal classifier. Again, explain what you see in each plot.
3. Now compare the performance of the LASSO and ridge regression for the breast cancer
dataset using the following steps:
• Randomly split the set of 295 patients into ten subsets of size 29-30.
• Use the data in eight of the subsets to find a solution to the Lasso optimization
above and to the ridge regression problem
min
w
kAw − dk
2
2 + λkwk
2
2
.
Repeat this for a range of λ values to obtain a set of solutions wλ.
• Compute the prediction error using each wλ on one of the remaining two of the
ten subsets. Use the solution that has the smallest prediction error to find the
best λ. Note that LASSO and ridge regression will produce different best values
for λ.
• Compute the test error on the final subset of the data for the choice of λ that
minimizes the prediction error. Compute both the squared error and the error
rate.
Repeat this process for different subsets of eight training, one tuning (λ) and one
testing subsets, and compute the average squared error and average number of misclassifications across all different subsets.
Note that you should use the identity derived in Problem 1 of the Activity 5.2 in order
to speed the computation of ridge regression.
2 of 2