Sale!

ECE760 HOMEWORK 4

$30.00 $18.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (1 vote)

1 Best Prediction

1.1 Under 0-1 Loss (10 pts)
Suppose the world generates a single observation x ∼ multinomial(θ), where the parameter vector θ = (θ1, . . . , θk)
with θi ≥ 0 and Pk
i=1 θi = 1.

Note x ∈ {1, . . . , k}. You know θ and want to predict x. Call your prediction xˆ.
What is your expected 0-1 loss:
E[1xˆ̸=x]
using the following two prediction strategies respectively? Prove your answer.

1. Strategy 1: xˆ ∈ arg maxx
θx, the outcome with the highest probability.

2. Strategy 2: You mimic the world by generating a prediction xˆ ∼ multinomial(θ). (Hint: your randomness
and the world’s randomness are independent)

1.2 Under Different Misclassification Losses (6 pts)
Like in the previous question, the world generates a single observation x ∼ multinomial(θ). Let cij ≥ 0 denote
the loss you incur, if x = i but you predict xˆ = j, for i, j ∈ {1, . . . , k}. cii = 0 for all i. This is a way to generalize
different costs of false positives vs false negatives from binary classification to multi-class classification.

You want
to minimize your expected loss:
E[cxxˆ].
Derive your optimal prediction xˆ.

2 Language Identification with Naive Bayes (8 pts each)

Implement a character-based Naive Bayes classifier that classifies a document as English, Spanish, or Japanese –
all written with 26 lower-case characters and space.

The dataset is languageID.tgz, unpack it. This dataset consists of 60 documents in English, Spanish, and Japanese.
The correct class label is the first character of the filename: y ∈ {e, j, s}. (Note: here each file is a document in
the corresponding language, and it is regarded as one data.)

We will be using a character-based multinomial Na¨ıve Bayes model. You need to view each document as a bag of
characters, including space. We have made sure that there are only 27 different types of printable characters (a to
z, and space) – there may be additional control characters such as new-line, please ignore those. Your vocabulary
will be these 27 character types. (Note: not word types!)

In the following questions, you may use the additive smoothing technique to smooth categorical data, in case the
estimated probability is zero.

Given N data samples {x
(i)
, y(i)}
N
i=1, where x
(i) = [x
(i)
1
, . . . , x
(i)
j
, . . . , x
(i)
Mi
] is a
bag of characters, Mi
is the total number of characters in x
(i)
, x
(i)
j ∈ S, y(i) ∈ L and we have |S| = KS, |L| =

KL. Here S is the set of all character types, and L is the set of all classes of data labels. Then by the additive
smoothing with parameter α, we can estimate the conditional probability as
Pα(as | y = ck) =
(
PN
i=1
PMi
j=1 1[x
(i)
j = as, y(i) = ck]) + α
(
P
bs∈S
PN
i=1
PMi
j=1 1[x
(i)
j = bs, y(i) = ck]) + KSα
,
where as ∈ S, ck ∈ L. Similarly, we can estimate the prior probability
Pα(Y = ck) = (
PN
i=1 1[y
(i) = ck]) + α
N + KLα
,
where ck ∈ L and N is the number of training samples.

1. Use files 0.txt to 9.txt in each language as the training data. Estimate the prior probabilities pˆ(y = e),
pˆ(y = j), pˆ(y = s) using additive smoothing with parameter 1
2
. Give the formula for additive smoothing
with parameter 1
2
in this case. Print the prior probabilities.

2. Using the same training data, estimate the class conditional probability (multinomial parameter) for English
θi,e := ˆp(ci
| y = e)
where ci
is the i-th character. That is, c1 = a, . . . , c26 = z, c27 = space.

Again, use additive smoothing
with parameter 1
2
. Give the formula for additive smoothing with parameter 1
2
in this case. Print θe which is
a vector with 27 elements.

3. Print θj , θs, the class conditional probabilities for Japanese and Spanish.

4. Treat e10.txt as a test document x. Represent x as a bag-of-words count vector (Hint: the vocabulary has
size 27). Print the bag-of-words vector x.

5. For the x of e10.txt, compute pˆ(x | y) for y = e, j, s under the multinomial model assumption, respectively.
Use the formula
pˆ(x | y) = Y
d
i=1
(θi,y)
xi
where x = (x1, . . . , xd). Show the three values: pˆ(x | y = e), pˆ(x | y = j), pˆ(x | y = s).

Hint: you may notice that we omitted the multinomial coefficient. This is ok for classification because it is
a constant w.r.t. y. Also, Store all probabilities here and below in log() internally to avoid underflow. This
also means you need to do arithmetic in log space.

6. For the x of e10.txt, use the Bayes rule and your estimated prior and likelihood, compute the posterior
pˆ(y | x). Show the three values: pˆ(y = e | x), pˆ(y = j | x), pˆ(y = s | x). Show the predicted class label of
x.

7. Evaluate the performance of your classifier on the test set (files 10.txt to 19.txt in three languages). Present
the performance using a confusion matrix. A confusion matrix summarizes the types of errors your classifier
makes, as shown in the table below.

The columns are the true language a document is in, and the rows are
the classified outcome of that document. The cells are the number of test documents in that situation.

For
example, the cell with row = English and column = Spanish contains the number of test documents that are
really Spanish but misclassified as English by your classifier.
English Spanish Japanese
English
Spanish
Japanese

8. Take a test document. Arbitrarily shuffle the order of its characters so that the words (and spaces) are scrambled beyond human recognition. How does this shuffling affect your Naive Bayes classifier’s prediction on
this document? Explain the key mathematical step in the Naive Bayes model that justifies your answer.

3 Simple Feed-Forward Network (20pts)

In this exercise, you will derive, implement back-propagation for a simple neural network and compare your
output with some standard library’s output. Consider the following 3-layer neural network.

yˆ = f(x) = g(W3σ(W2σ(W1x)))
Suppose x ∈ R
d
, W1 ∈ R
d1×d
, W2 ∈ R
d2×d1
, W3 ∈ R
k×d2
i.e. f : R
d → R
k
, Let σ(z) = [σ(z1), …, σ(zn)]
for any z ∈ R
n where σ(z) = 1
1+e−z is the sigmoid (logistic) activation function and g(zi) = P exp(zi)
k
i=1 exp(zi)
is the
softmax function.

Suppose the true pair is (x, y) where y ∈ {0, 1}
k with exactly one of the entries equal to 1, and
you are working with the cross-entropy loss function given below,
L(x, y) = −
X
k
i=1
ylog(ˆy)

1. Derive backpropagation updates for the above neural network. (5 pts)

2. Implement it in NumPy or PyTorch using basic linear algebra operations. (e.g. You are not allowed to
use auto-grad, built-in optimizer, model, etc. in this step. You can use library functions for data loading,
processing, etc.). Evaluate your implementation on MNIST dataset, report test errors and learning curve.
(10 pts)

3. Implement the same network in PyTorch (or any other framework). You can use all the features of the
framework e.g. auto-grad etc. Evaluate it on MNIST dataset, report test errors, and learning curve. (2 pts)

4. Try different weight initialization a) all weights initialized to 0, and b) initialize the weights randomly
between -1 and 1. Report test error and learning curves for both. (You can use either of the implementations)
(3 pts)

You should play with different hyperparameters like learning rate, batch size, etc. for your own learning. You
only need to report results for any particular setting of hyperparameters.

You should mention the values of
those along with the results. Use d1 = 300, d2 = 200, d3 = 100. For optimization use SGD (Stochastic
gradient descent) without momentum, with some batch size say 32, 64, etc. MNIST can be obtained from here
(https://pytorch.org/vision/ stable/datasets.html)