Homework 2 – Machine Learning CS4342

$35.00

Category: Tags: , , , , You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (2 votes)

1. Linear regression for age estimation: Train an age regressor that analyzes a (48×48 = 2304)-pixel
grayscale face image and outputs a real number ˆy that estimates how old the person is (in years). Your
regressor should be implemented using linear regression. The training and testing data are available
here:
• https://s3.amazonaws.com/jrwprojects/age_regression_Xtr.npy
• https://s3.amazonaws.com/jrwprojects/age_regression_ytr.npy
• https://s3.amazonaws.com/jrwprojects/age_regression_Xte.npy
• https://s3.amazonaws.com/jrwprojects/age_regression_yte.npy
Note: you must complete this problem using only linear algebraic operations in numpy – you may not
use any off-the-shelf linear regression software, as that would defeat the purpose.
(a) Analytical solution [15 points]: Compute the optimal weights w = (w1, . . . , w2304) and bias
term b for a linear regression model by deriving the expression for the gradient of the cost function
w.r.t. w and b, setting it to 0, and then solving. The cost function is
fMSE(w, b) = 1
2n
Xn
i=1
(ˆy
(i) − y
(i)
)
2
where ˆy = g(x; w, b) = x
>w + b and n is the number of examples in the training set
Dtr = {(x
(1), y(1)), . . . ,(x
(n)
, y(n)
)}, each x
(i) ∈ R
2304 and each y
(i) ∈ {0, 1}. After optimizing w
and b only on the training set, compute and report the cost fMSE on the training set Dtr and
(separately) on the testing set Dte. Suggestion: to solve for w and b simultaneously, use the
trick shown in class whereby each image (represented as a vector x) is appended with a constant
1 term (to yield an appended representation x˜). Then compute the optimal w˜ (comprising the
original w and an appended b term) using the closed formula:
w˜ =

X˜ X˜ >
−1
Xy˜
For appending, you might find the functions np.hstack, np.vstack, np.atleast 2d useful. After
optimizing w˜ and b (using fMSE), compute and report (in the PDF file) the cost fMSE on the
training set Dtr and (separately) the testing set Dte.
(b) Gradient descent [20 points]: Pick a random starting value for w ∈ R
2304 and b ∈ R and
a small learning rate (e.g., = .001). (In my code, I sampled each component of w and b
from a Normal distribution with standard deviation 0.01; use np.random.randn). Then, using
the expression for the gradient of the cost function, iteratively update w, b to reduce the cost
fMSE(w, b). Stop after conducting T gradient descent iterations (I suggest T = 5000 with a step
size (aka learning rate) of = 0.003). After optimizing w and b only on the training set, compute
and report the cost fMSE on the training set Dtr and (separately) on the testing set Dte. After
optimizing w and b (using fMSE), compute and report (in the PDF file) the cost fMSE on the
training set Dtr and (separately) the testing set Dte.
Note: as mentioned during class, on this particular dataset it would take a very long time using
gradient descent to reach weights as the w found by the analytical solution. For T = 5000, your
training cost on part (b) will be higher than for part (a). However, the testing cost should actually
be lower since the relatively small number of gradient descent steps prevents w from growing too
large and hence acts as an implicit regularizer.
1
(c) Regularization [15 points]: Same as (b) above, but change the cost function to include a
penalty for |w|
2 growing too large:
˜fMSE(w) = 1
2n
Xn
i=1
(ˆy
(i) − y
(i)
)
2 +
α
2n
w>w
where α ∈ R
+. Set α = 0.1 (this worked well for me) and then optimize ˜fMSE w.r.t. w and
b. After optimizing w and b (using ˜fMSE), compute and report (in the PDF file) the cost fMSE
(without the L2 term) on the training set Dtr and (separately) the testing set Dte. Important:
the regularization should be applied only to the w, not the b. I suggest a regularization strength
of α = 0.1.
Note: as mentioned during class, since part (b) already provides implicit regularization by limiting
the number of gradient descent steps (to T = 5000), you should not expect to see much (or any)
difference between parts (c) and (b) on this dataset. In general, however, the L2 regularization
term can make a big difference.
(d) Visualizing the machine’s behavior [10 points]: After training the regressors in parts (a),
(b), and (c), create a 48 × 48 image representing the learned weights w (without the b term)
from each of the different training methods. Use plt.imshow(). How are the weight vectors from
the different methods different? Next, using the regressor in part (c), predict the ages of all the
images in the test set and report the RMSE (in years). Then, show the top 5 most egregious
errors, i.e., the test images whose ground-truth label y is farthest from your machine’s estimate
yˆ. Include the images, along with associated y and ˆy values, in a PDF. 4
2. Polynomial regression [10 points]: Given a dataset Dtr = {(x
(i)
, y(i)
)}
n
i=1, where each x
(i) ∈ R
and each y
(i) ∈ R, and given a non-negative integer d, train a polynomial regression model of degree d.
Specifically, return the weight vector w = [w0, w1, . . . , wd]
> ∈ R
d+1 that minimizes the MSE for a machine whose output ˆy =
Pd
j=0 x
jwj . Write your implementation in a function trainPolynomialRegressor.
Note that the regression model you are building here works only for scalar inputs; it does not apply
to the age estimation task from problem 1.
Submission: Put your Python code in a Python file called homework2 WPIUSERNAME.py
and put the reported accuracy information and error analysis in homework2 errors WPIUSERNAME.pdf.
2