Description
This assignment will give you hands-on experience in building text classification models, using the application of email spam filtering. The target variable represents whether an email is either spam (1) or non-spam (0). Follow the directions and answer following questions.
Question 1
Explore different ways to improve the classification performance (accuracy or expected cost). You can consider Do the following:
- Feature representation: Compare 3 feature representations; binary vs. frequency vs. tf-idf
- Classifier: compare 3 classifiers of your choice such as decision trees, neural nets, etc.
- OPTIONAL: Feature selection: different feature/attribute selection methods or parameters (extra credit)
Report the evaluation results of your model using split training and testing. Report the following:
- Precision and Recall by Class
- Confusion Matrix.
Question 2
Calculate the total cost and expected cost (per email) based on the confusion matrix you obtained in question. Assume the cost for each mis-classified email from Spam to Non-spam is 5, and from Non-spam to Spam is 100.
[Hint: be careful with the dimensions of the confusion matrix: which are the “actuals” and which are the “predictions”?]
Based on your observation, please analyze which combination of feature and classifier is the best.
Question 3 (Extra credit)
Run 10-fold cross-validation instead of split sample. Does your conclusion still hold? If the observation is different, could you analyze the cause?