Category: CS 6140

Description

- Generate data: Simulate a binary classification problem by generating a vector of class labels. Size 100. Generate a vector of predictor estimates using a random number generator.
**(5 Points)** - Calculate and plot ROC and Precision-Recall curves.
**(20 Points)** - Match your curve generated with sklearn.
**(5 Points)**

- Load iris data set.

```
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, plot_confusion_matrix
```

Investigate following parameters of Random Forest classifier and tune them using Randomized Search and Grid Search.

```
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt','log2']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 1000,10)]
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 8, 11,14]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4,6,8]
```

- Use seed 1 to split data in 80-20 train-test configuration. Train a Random Forest classifier with each unique configuration and record train/test accuracy, precision and recall in the results dataframe. This dataframe will have 5 columns (each corresponding to tuning parameter) and each row will correspond to each unique configuration. 5x5x5x5x5 rows. Analyse of the impact of each tuning parameter on predictor performance.
**(15 Points)** - From the results of the above find the best estimators and use them for classifcation once again and evaluate the performance using 10 fold cross validation.
**(15 Points)**

Load iris dataset from sklearn.

```
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
```

- Implement HAC algorithm. Use the abstract class definition provided below.
**(15 Points)** - Test your code first with uni-variate data as following;
**(10 Points)**`x = {'JAN':31.9, 'FEB':32.3, 'MAR':35, 'APR':52, 'MAY':60.8, 'JUN':68.7, 'JUL':73.3, 'AUG':72.1, 'SEP':65.2, 'OCT':54.8, 'NOV':40, 'DEC':38} hac = HAC(param={'dist': 'eucl'}) hac.fit(x) for c in hac.dendrogram: print(c)`

Expected output:

`(0, ['JAN', 'FEB'], 0.4) (1, ['JUL', 'AUG'], 1.2) (2, ['NOV', 'DEC'], 2.0) (3, ['APR', 'OCT'], 2.8) (4, ['JAN', 'FEB', 'MAR'], 2.9) (5, ['JUN', 'SEP'], 3.5) (6, ['APR', 'OCT', 'MAY'], 7.4)`

- Fit the HAC model to iris dataset. Print the heirarchy of clusters creatively. It need not to be a dendrogram but you can use sklearn implementation for comparison.
**(15 Points)**

```
class HAC:
def __init__(self, X, param):
self.X = x
self.__distances__(param['dist'])
def __distances__(self, dist='eucl'):
'''
Implement __distances__ method to caculate pair-wise distances
among datapoint in X with respect to distance measures
- eucl : eucledean distance
- manh : manhattan
- misk : miskownski
'''
if dist not in ['eucl', 'manh', 'misk']:
raise Exception('Not a valid dist measure. Choose among eucl, manh, misk')
self.C = None
def __merge__(self):
'''
Implement __merge__ method to recursively merge the nearest datapoints in X
using pair-wise distances matrix X.
Save the merge results at each iteration/'recursive call'
in dendrogram list of clusters.
'''
self.denrogram = None
def __display__(self):
'''
Implement __display__ method to cretively show the contents of dendrogram.
'''
pass
def fit(self, X):
self.X = list(x.values())
self.labels = list(x.keys())
self.__distances__()
self.dendrogram = list()
self.__merge__()
```

