Home Work # 6. AMS 597

$30.00

Category: Tags: , , , , You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (9 votes)

Random Forest with the Spam Data – Classification Task

How to separate Spam e-mails from non-Spam e-mails? Thanks to the advancement in text mining, we can now readily generate a set of relevant numerical attributes to help in such classification tasks. The Spambase.data we will use for our homework is taken from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/spambase). Your task is to split the data randomly into training (75%) and testing (25%), first build the best random forest to predict Spam e-mails using the training data, then use the out-of-bag (OOB) data to measure its performance, and then use this random forest model to predict whether each e-mail in the testing data is Spam or not. Please use the randomforest function in R to build the random forest classifier.

 

Please review the following website for related methods and concepts:

http://www.sthda.com/english/articles/35-statistical-machine-learning-essentials/140-bagging-and-random-forest-essentials/

 

 

  1. Please use the random seed 123 to divide the cleaned data into 75% training and 25% testing.

 

  1. Please first build the random forest to predict Spam e-mail using the training data. Please compute the Confusion matrix and report the sensitivity, specificity and the overall accuracy using the out of bag (OOB) samples.

 

  1. Next please use this random forest to predict whether each email in the testing data is Spam or not. Please compute the Confusion matrix and report the sensitivity, specificity and the overall accuracy for the testing data.

 

  1. Please plot the variables importance measures using

 

  1. MeanDecreaseAccuracy, which is the average decrease of model accuracy in predicting the outcome of the out-of-bag samples when a specific variable is excluded from the model.
  2. MeanDecreaseGini, which is the average decrease in node impurity that results from splits over that variable. The Gini impurity index is only used for classification problem.

 

  1. Please show the importance of each variable in percentage based on MeanDecreaseAccuracy.

 

  1. In a regression task using the random forest, suppose we have 26 variables (as predictors) in the original data set – then at each node split, what is the number of variables we should (as commonly recommended) to select, at random, to be considered for that node split?