Description
1 Logistic Regression / LDA
For this section we will be using a beginner friendly dataset to test a simple binary classification task of whether a person will
be interested in learning a new sport based on just two attributes namely, age and interest quotient, that can be downloaded
from Kaggle. With respect to this dataset, implement/report the following:
1. Plot the dataset using different colors for the two classes. [5 Marks]
2. Implement the least square method for classification and plot the decision boundary. Clearly describe your results. Is the
decision boundary able to classify the points correctly? [15 Marks]
3. Implement the logistic regression using gradient descent method. Choose the initial values of w in the range [−0.1, 0.1]. Plot
a 3D figure depicting the sigmoid function obtained along with the same color coding of the points. Did the performance
improved as compared to previous question? [15 Marks]
4. Plot the decision boundary obtained for logistic regression. [5 Marks]
5. Find the linear discriminant boundary and describe your results. [10 Marks]
6. Logistic regression considers only linear decision boundaries. One way to go from linear decision boundaries to non-linear
decision boundaries is by considering polynomial curve of higher degree. For example, if input attributes are x1, x2 then
transforming it into 2 degree polynomial will give features: {x1, x2, x2
1
, x2
2
, x1x2, 1}. Identify an appropriate degree of the
transformation that results in the optimal performance via logistic regression. Clearly explain your choice. [10 Marks]
7. Above expansion will result in non-linear decision boundary. Plot the boundary along with the dataset points. [5 Marks]
2 PCA / Decision Trees/ Random Forests
The dataset we used in the previous section had just two numeric attributes. In this section we will look at a slightly sophisticated
dataset having a mix of numeric and categorical attributes describing an adult. The dataset can be downloaded from UCI Machine
Learning repository. The task is to predict whether the person defined by the given set of attributes earns more than 50000 or
less (Binary classification task). Implement the following and state your results with respect to this dataset.
1. Implement the decision tree algorithm to classify whether the income of a particular user exceeds $50K per year or not.
Divide the data into two sets: Training set (80%) and validation set (20%). Plot the training error and validation error
against the number of nodes present in the decision tree. Describe the optimal decision tree in your video. [15 Marks]
2. Create 10 datasets using bootstrap technique and rerun the part 1 to find the optimal decision tree for each of these
datasets. Report the final error by taking the average of each decision tree and report your findings. Did the performance
improved? [10 Marks]
3. Implement PCA to find optimal number of features. Plot the error of optimal decision tree against the number of features.
How many features did it require to match the performance of the tree obtained in the first part. [10 Marks]
Vidhya Kamakshi 2 of 2 Kapil Rana