Description
I. Support Vector Machine
Support vector machine (SVM) is a type of supervised learning model in machine
learning. In classification problems, SVM models find optimal hyperplanes with the
help of support kernels. For SVMs, data points are viewed as vectors, and they try to
separate the data points using a hyperplane. SVM finds an optimal hyperplane that has
the largest margin. But in many cases, the data points are not linearly separable. One
way to solve this problem is to transform data points into higher dimensional space
using nonlinear kernels such as polynomial kernel. Another solution is to apply soft
margin that allows some data points to be in region of separation, and minimize
classification error.
In SVM, the primal problem of finding weights and bias is transformed into an
alternative, simpler form called the dual problem, with the help of Lagrangeβs Theorem.
After obtaining Lagrange multipliers {πΌπΌππ}, we can then calculate weights and bias and
determine the support vectors and discriminant function.
In the following experiments, different SVM models are implemented to
discriminate spam emails with spam email data sets.
II. Task 1: Training SVM
In task 1, different SVM models are trained with training data in train.mat.
Implementing details are as below:
1) Pre-process the data by standardization. Standardize the values of every
feature so that they have zero-mean. In this case, compute each row in
train_data using the following equation:
π₯π₯β² = π₯π₯ β π₯π₯Μ
ππ .
2) After pre-processing the data, compute linear and polynomial kernels as
instructed. For polynomial kernel SVM models, after obtaining kernel matrix,
check if they meet mercerβs conditions. Set the threshold for negative value to
β1 Γ 106.
3) Calculate Lagrange multipliers {πΌπΌππ } with different margin setting. Set the
value of πΆπΆ for hard margin to 106 . For each value of πΆπΆ β
[106, 0.1,0.6,1.1,2.1], compute πΌπΌππ using quadprog function using the kernels
that meet mercerβs condition.
4) With the obtained πΌπΌππ , determine support vector and get the discriminant
function. Set the threshold of πΌπΌππ > 0 as 1 Γ 10β4.
III. Task 2: Classification Results and Analysis
3.1 Experiment results
In task 2, the SVM models that are obtained in task 1 are tested.
Two of the polynomial kernels did not meet mercerβs condition: πΎπΎ(π₯π₯1, π₯π₯2) =
(π₯π₯1
βΊ
π₯π₯2 + 1)4,πΎπΎ(π₯π₯1, π₯π₯2) = (π₯π₯1
βΊ
π₯π₯2 + 1)5 . For the rest of the models, the accuracy of
classification are examined in both training and test data sets. The results are shown in
TABLE I.
Type of SVM Training accuracy Test accuracy
Hard margin with
Linear kernel
0.9325 0.9297
Hard margin with
polynomial kernel
p=2 p=3 p=4 p=5 p=2 p=3 p=4 p=5
1 1 NA NA 0.8535 0.8581 NA NA
Soft margin with
polynomial kernel
C=0.1 C=0.6 C=1.1 C=2.1 C=0.1 C=0.6 C=1.1 C=2.1
p=1 0.9235 0.9260 0.9270 0.9280 0.9225 0.9245 0.9245 0.9271
p=2 0.9880 0.9950 0.9950 0.9950 0.8945 0.8874 0.8763 0.8757
p=3 0.9965 0.9980 0.9985 0.9990 0.8848 0.8802 0.8743 0.8711
p=4 NA NA NA NA NA NA NA NA
p=5 NA NA NA NA NA NA NA NA
TABEL I: Results from SVM classification
3.2 Analysis
From TABLE I we can get the models of polynomial kernels ππ = 1, ππ = 2 with
hard margin reaches the highest training set accuracy. While the model of hard margin
with linear kernel achieves the best performance in test set.
For better comparison, I plot the accuracy curve in groups. The first comparison
is based on models with the same margin settings, varying the kernels, as shown in the
upper plot of Fig.1(a) and (b). The other comparison is based on models with the same
kernels, but with different margin settings, shown in the second row of Fig. 1.
(a) Training set accuracy (b) Test set accuracy
Figure 1
From the top image in Fig. 1(a) we can conclude that overall, models with hard
margin performs better in training set, while models with soft margin (πΆπΆ = 0.1) have
the poorest performance. The other 3 curves are almost the same (πΆπΆ = 0.6,1.1,2.1). The
parameter πΆπΆ controls the cost of violating constraints. A large value suggests hard
penalty on misclassification of training data, which leads to small margin. On the other
hand, a small value of πΆπΆ leads to large margin but with more misclassification of
training data. Thus, when πΆπΆ = 0.1, which is a small value, the accuracy of training set
drops.
However, in the top image of Fig. 1(b), models with soft margin have higher
accuracy in some cases. This is because models trained with soft margin, especially
those with small values of πΆπΆ, has better ability to cope with unknown data. If the data
is not linearly separable, which is true in this experiment, a smaller πΆπΆ can help to avoid
overfitting situation.
In the lower image in Fig.1 (a), itβs clear that models with linear kernels
(polynomial kernel with ππ = 1) have the worst performance. This also suggests that
the training data is not linearly separable. The other models have similar performance,
while models with ππ = 3 have slightly higher training set accuracy.
On the other hand, as shown in the lower image in Fig.1 (b), models with linear
kernels outperform the other models overall, which is inconsistent with what we
discussed above. In theory, if the data is not linearly separable, the model trained with
linear kernel is not possible to have good performance when it is applied to test data.
One possible explanation could be that the data is close to be linearly separable, with
few data points violating the discriminant function.
3.3 Summary
To sum up, in order to achieve higher accuracy in application, we should adopt
soft margin with a small value of πΆπΆ. Also, although models with linear kernel have
outstanding performance in test set, thereβs no guarantee that they would perform as
well when trying with new data. Thus, in the above models, the model with soft margin
(πΆπΆ = 0.1) and polynomial kernel (ππ = 2) is the best choice.
IV.Task 3: Design SVM
Beside polynomial kernels, SVM can also have kernels with Gaussian radial basis
function (i.e., πΎπΎ(π₯π₯1, π₯π₯2) = exp (βπΎπΎβπ₯π₯1 β π₯π₯2β2) ) and hyperbolic tangent (i.e.,
πΎπΎ(π₯π₯1, π₯π₯2) = tanh (πππ₯π₯1 β π₯π₯2 + ππ)).
After implementing some examples, I found that hyperbolic tangent kernels do not
satisfy mercerβs condition. Some further experiments are conducted with Gaussian
radial basis function kernels.
To find a proper SVM model, I randomly selected 600 data samples from the given
website (UC Irvine Machine Learning Repository) and get an evaluation data set.
Trained SVM models are applied to training, test and evaluation data set. The results
are shown in TABLE II.
πΎπΎ = 1 πΎπΎ = 10 πΎπΎ = 100
Train Test Eval Train Test Eval Train Test Eval
πΆπΆ = 0.1 0.4045 0.3874 0.4067 0.9950 0.8581 0.2900 0.9995 0.8542 0.2900
πΆπΆ = 1 0.4045 0.3874 0.4067 0.9995 0.7083 0.3900 1.0000 0.6126 0
πΆπΆ = 10 0.4045 0.3874 0.4067 1 0.3887 0.4067 1 0.6126 0
TABLE II: Accuracy of models with RBF kernels
When πΆπΆ is set to be less than 1, for example, πΆπΆ = 0.1, the accuracy of training
set is very low. When πΆπΆ is given a larger value, for example, πΆπΆ = 10, accuracy in test
and evaluation set is terrible. Also, when πΎπΎ is small, the models have low accuracy in
all three data sets. When πΎπΎ is large, the accuracy in evaluation set is extremely low.
I also experimented with polynomial kernels, but the performance is poor.
Overall, the model with πΎπΎ = 10 and πΆπΆ = 1 has the most balanced performance
among three data sets.