Description
1) Naïve Bayes
With stop words:
Without stop words:
In total, the accuracy for ham is quite high, but the accuracy for spam is not that good. This means
that the possibility of a ham is considered as a spam than a spam is considered as a ham. This is
good because this algorithm would nearly never junk a ham. The total accuracy is over 90% which
means the algorithm has potential for practical application.
About stop words, the accuracy for ham decreases a little but the accuracy for spam improves a
lot. This means that some significant words might occur in only in spams. Maybe some other stop
words should be added to the default stop words list.
2) Logistic Regression
Hard limit: 200 times
With stop words:
Without stop words:
In total, the accuracy for ham is still larger than that for spam and total. Hence, the conclusion of
Naive Bayes can also be used in Logistic Regression. The algorithm would prefer to let all hams
in.
About stop words, the accuracy for ham increases a little but the accuracy for spam stays. This is
conflict with the conclusion we got in Naïve Bayes. I would prefer to try until converging to see
whether the accuracies would be affected. But for now, we can get that stop words would not
affect the algorithm of Logistic Regression.
Besides, the accuracies of Logistic Regression are lower than Naïve Bayes. This might be caused
by using hard limit instead of converging.
About lambda, for the given range of lambda [0.01, 0.1], the result nearly never be affected by
lambda. This means that we could choose lambda in suitable range without concerning about
lambda affects the result. I also try some bigger and smaller lambda only with stop words, and
the result has an insignificant change until lambda = 1.0.
Smaller lambda: [0.0001, 0.001, 0.005]
Bigger lambda: [0.2, 0.5, 1]
3) Others
a) In train set, 2248.2004-09-23.GP.spam.txt will cause UnicodeDecodeError. Hence, I ignore it
when open it.
b) When do exp in Logistic Regression, the sum often above 700 which might cause overflow. In
this, we will let the probability be 1 for:
lim!→#
�
1 + � = 1