Description
1. (30 points) Decision Trees (and Random Forests):
For this part of the assignment you will work on detecting forest covertype using CoverType dataset. You
can read about the dataset in detail from this link but work only on the data provided in the download link
on the course page. You have been provided with a pre-defined set of test, train and validation of dataset
to work with (available for download from the course website). In this dataset, the first row represents
the labels of each column such that the last column represents the cover type and following that, each row
represents an example. You have to implement the decision tree algorithm for predicting the cover type.
You will also experiment with Random Forests in the last part of this problem.
(a) (10 points) Decision Tree Construction Construct a decision tree using the given data to predict
which files are infected. You should use mutual information as the criteria for selecting the attribute
to split on. At each node, you should select the attribute which results in maximum decrease in
the entropy of the class variable (i.e. has the highest mutual information with respect to the class
variable). This problem has some attributes as integers and some as boolean. For handling continuous
attributes, you should use the following procedure. At any given internal node of the tree, a numerical
attribute is considered for a two way split by calculating the median attribute value from the data
instances coming to that node, and then computing the information gain if the data was split based
on whether the numerical value of the attribute is greater than the median or not. For example, if
1
you have have 10 instances coming to a node, with values of an attribute being (0,0,0,1,1,2,2,2,3,4) in
the 10 instances, then we will split on value 1 of the attribute (median). Note that in this setting, a
numerical attribute can be considered for splitting multiple number of times. At any step, choose the
attribute which results in highest mutual information by splitting on its median value as described
above. Plot the train, validation and test set accuracies against the number of nodes in the tree as you
grow the tree. On X-axis you should plot the number of nodes in the tree and Y-axis should represent
the accuracy. Comment on your observations.
(b) (6 points) Decision Tree Post Pruning One of the ways to reduce overfitting in decision trees
is to grow the tree fully and then use post-pruning based on a validation set. In post-pruning, we
greedily prune the nodes of the tree (and sub-tree below them) by iteratively picking a node to prune
so that resultant tree gives maximum increase in accuracy on the validation set. In other words, among
all the nodes in the tree, we prune the node such that pruning it (and sub-tree below it) results in
maximum increase in accuracy over the validation set. This is repeated until any further pruning leads
to decrease in accuracy over the validation set. Read the following notes on pruning decision trees to
avoid overfitting (also available from the course website). Post prune the tree obtained in step (a)
above using the validation set. Again plot the training, validation and test set accuracies against the
number of nodes in the tree as you successively prune the tree. Comment on your findings.
(c) (10 points) Random Forests: As discussed in class, Random Forests are extensions of decision
trees, where we grow multiple decision trees in parallel on bootstrapped samples constructed from the
original training data. A number of libraries are available for learning Random Forests over a given
training data. In this particular question you will use the scikit-learn library of Python to grow a
Random Forest. Click here to read the documentation and the details of various parameter options.
Try growing different forests by playing around with various parameter values. Especially, you should
experiment with the following parameter values (in the given range): (a) n estimators (50 to 450 in
range of 100). (b) max features (0.1 to 1.0 in range of 0.2) (c) min samples split (2 to 10 in range of
2). You are free to try out non-default settings of other parameters too. Use the out-of-bag accuracy
(as explained in the class) to tune to the optimal values for these parameters. You should perform
a grid search over the space of parameters (read the description at the link provided for performing
grid search). Report training, out-of-bag, validation and test set accuracies for the optimal set of
parameters obtained. How do your numbers, i.e., train, validation and test set accuracies compare
with those you obtained in part (b) above (obtained after pruning)?
(d) (4 points) Random Forests – Parameter Sensitivity Analysis: Once you obtain the optimal
set of parameters for Random Forests (part (c) above), vary one of the parameters (in a range) while
fixing others to their optimum. Plot the validation and test Repeat this for each of the parameters
considered above. What do you observe? How sensitive is the model to the value of each parameter?
Comment.
2. (30 points) Neural Networks :
In this problem, you will work with the Kannada Digit Classification Dataset available for download from
the course website. You can read about the dataset here. This is an image dataset consisting of grayscale
pixel values of 28×28 size images of 10 categories representing 0-9 digits in Kannada language. The Dataset
contains a train and a test set consisting of 60000 and 10000 examples, respectively. Four files representing
(x, y) pairs for train and test are given in npy format. You can load them using numpy’s load method.
Each row in x train and x test contains 784 columns representing a flattened image of size 28 × 28 The
i
th row in y train gives the digit number for the i
th example row of x train (similarly for x test and y test
files). In this problem, we will use what is referred to a one-hot encoding of the output labels. Given a total
of r classes, the output is represented an r sized binary vector (i.e., y ∈ Rr×1
), such that each component
represents a Boolean variable, i.e., yl ∈ {0, 1}, ∀l, 1 ≤ l ≤ r). In other words, each y vector will have exactly
one entry as being 1 which corresponds to the actual class label and all others will be 0. This is one of
the standard ways to represent discrete data in vector form 1
. Corresponding to each output label yl
, our
network will produce (independently) an output label ol where ol ∈ [0, 1].
(a) (10 points) Write a program to implement a generic neural network architecture to learn a model for
multi-class classification using one-hot encoding as described above. You will implement the backpropagation algorithm (from first principles) to train your network. You should use mini-batch Stochastic
Gradient Descent (mini-batch SGD) algorithm to train your network. Use the Mean Squared Error
1Feel free to read about the one-hot encoding of discrete variables online
2
(MSE) over each mini-batch as your loss function. Given a total of m examples, and M samples in
each batch, the loss corresponding to batch #b can be described as:
J
b
(θ) = 1
2M
X
bM
i=(b−1)M
Xr
l=1
(y
(i)
l − o
(i)
l
)
2
(1)
Here each y
(i)
is represented using one-hot encoding as described above. You will use the sigmoid as
activation function for the units in output layer as well as in the hidden layer (we will experiment with
other activation units in one of the parts below). Your implementation (including back-propagation)
MUST be from first principles and not using any pre-existing library in Python for the same. It should
be generic enough to create an architecture based on the following input parameters:
• Mini-Batch Size (M)
• Number of features/attributes (n)
• Hidden layer architecture: List of numbers denoting the number of perceptrons in the corresponding hidden layer. Eg. a list [100 50] specifies two hidden layers; first one with 100 units and second
one with 50 units.
• Number of target classes (r)
Assume a fully connected architecture i.e., each unit in a hidden layer is connected to every unit in
the next layer.
(b) (5 points) Use the above implementation to experiment with a neural network having a single hidden
layer. Vary the number of hidden layer units from the set {1, 10, 50, 100, 500}. Set the learning rate to
0.001. Use a mini-batch size of 100 examples. This will remain constant for the remaining experiments
in the parts below. Choose a suitable stopping criterion and report it. Report and plot the accuracy
on the training and the test sets, time taken to train the network. Plot the metric on the Y axis
against the number of hidden layer units on the X axis. What do you observe? How do the above
metrics change with the number of hidden layer units? NOTE: For accuracy computation, the inferred
class label is simply the label having the highest probability as output by the network.
(c) (5 points) Use an adaptive learning rate inversely proportional to number of epochs i.e. ηe = √
η0
e
where η0 = 0.5 is the seed value and e is the current epoch number2
. See if you need to change your
stopping criteria. Report your stopping criterion. As before, plot the train/test set accuracy, as well
as training time, for each of the number of hidden layers as used in 2b above using this new adaptive
learning rate. How do your results compare with those obtained in the part above? Does the adaptive
learning rate make training any faster? Comment on your observations.
(d) (5 points) Several activation units other than sigmoid have been proposed in the literature such as
tanh, and RelU to introduce non linearity into the network. ReLU is defined using the function: g(z)
= max(0, z). In this part, we will replace the sigmoid activation units by the ReLU for all the units
in the hidden layers of the network (the activation for units in the output layer will still be sigmoid
to make sure the output is in the range (0, 1)). You can read about relative advantage of using the
ReLU over several other activation units on this blog.
Change your code to work with the ReLU activation unit. Note that there is a small issue with
ReLU that it non-differentiable at z = 0. This can be resolved by making the use of sub-gradients –
intuitively, sub-gradient allows us to make use of any value between the left and right derivative at the
point of non-differentiability to perform gradient descent (see this Wikipedia page for more details).
Implement a network with 2 hidden layers with 100 units each. Experiment with both ReLU and
sigmoid activation units as described above. Use the adaptive learning rate as described in part2c
above. Report your training and test set accuracies in each case. Also, make a relative comparison of
test set accuracies obtained using the two kinds of units. What do you observe? Which ones performs
better? Also, how do these results compare with results in part 2b using a single hidden layer with
sigmoid. Comment on your observations.
(e) (5 points) Use MLPClassifier from scikit-learn library to implement a neural network with the same
architecture as in Part 2d above, and same kind of activation functions (ReLU for hidden layers, sigmoid for output). Use Stochastic Gradient Descent as the solver. Note that MLPClassifier
only allows for Cross Entropy Loss (log-loss function) over the final network output. Read about
2One epoch corresponds to one complete pass through the data
3
the binary cross entropy loss . Compare the performance with the results of Part 2d. How does training using existing library (and modified loss function) compare with your results in Part 2d above.
(f) Extra fun – No credits! Kannada digit dataset was motivated by another famous digits dataset
known as MNIST (available for download from course web-page). You can read about it here. In this
part, we want you to try out doing classification on MNIST using the same architecture that worked
in recognizing Kannada digits and see how well it performs. How robust is the architecture to a new
dataset? Comment on your observations and report you best model.
4