Description

5/5 - (4 votes)

I. Support Vector Machine
Support vector machine (SVM) is a type of supervised learning model in machine
learning. In classification problems, SVM models find optimal hyperplanes with the
help of support kernels. For SVMs, data points are viewed as vectors, and they try to
separate the data points using a hyperplane. SVM finds an optimal hyperplane that has
the largest margin. But in many cases, the data points are not linearly separable. One
way to solve this problem is to transform data points into higher dimensional space
using nonlinear kernels such as polynomial kernel. Another solution is to apply soft
margin that allows some data points to be in region of separation, and minimize
classification error.
In SVM, the primal problem of finding weights and bias is transformed into an
alternative, simpler form called the dual problem, with the help of Lagrange’s Theorem.
After obtaining Lagrange multipliers {𝛼𝛼𝑖𝑖}, we can then calculate weights and bias and
determine the support vectors and discriminant function.
In the following experiments, different SVM models are implemented to
discriminate spam emails with spam email data sets.
II. Task 1: Training SVM
In task 1, different SVM models are trained with training data in train.mat.
Implementing details are as below:
1) Pre-process the data by standardization. Standardize the values of every
feature so that they have zero-mean. In this case, compute each row in
train_data using the following equation:
𝑥𝑥′ = 𝑥𝑥 − 𝑥𝑥̅
𝜎𝜎 .
2) After pre-processing the data, compute linear and polynomial kernels as
instructed. For polynomial kernel SVM models, after obtaining kernel matrix,
check if they meet mercer’s conditions. Set the threshold for negative value to
−1 × 106.
3) Calculate Lagrange multipliers {𝛼𝛼𝑖𝑖 } with different margin setting. Set the
value of 𝐶𝐶 for hard margin to 106 . For each value of 𝐶𝐶 ∈
[106, 0.1,0.6,1.1,2.1], compute 𝛼𝛼𝑖𝑖 using quadprog function using the kernels
that meet mercer’s condition.
4) With the obtained 𝛼𝛼𝑖𝑖 , determine support vector and get the discriminant
function. Set the threshold of 𝛼𝛼𝑖𝑖 > 0 as 1 × 10−4.
III. Task 2: Classification Results and Analysis
3.1 Experiment results
In task 2, the SVM models that are obtained in task 1 are tested.
Two of the polynomial kernels did not meet mercer’s condition: 𝐾𝐾(𝑥𝑥1, 𝑥𝑥2) =
(𝑥𝑥1
⊺
𝑥𝑥2 + 1)4,𝐾𝐾(𝑥𝑥1, 𝑥𝑥2) = (𝑥𝑥1
⊺
𝑥𝑥2 + 1)5 . For the rest of the models, the accuracy of
classification are examined in both training and test data sets. The results are shown in
TABLE I.
Type of SVM Training accuracy Test accuracy
Hard margin with
Linear kernel
0.9325 0.9297
Hard margin with
polynomial kernel
p=2 p=3 p=4 p=5 p=2 p=3 p=4 p=5
1 1 NA NA 0.8535 0.8581 NA NA
Soft margin with
polynomial kernel
C=0.1 C=0.6 C=1.1 C=2.1 C=0.1 C=0.6 C=1.1 C=2.1
p=1 0.9235 0.9260 0.9270 0.9280 0.9225 0.9245 0.9245 0.9271
p=2 0.9880 0.9950 0.9950 0.9950 0.8945 0.8874 0.8763 0.8757
p=3 0.9965 0.9980 0.9985 0.9990 0.8848 0.8802 0.8743 0.8711
p=4 NA NA NA NA NA NA NA NA
p=5 NA NA NA NA NA NA NA NA
TABEL I: Results from SVM classification
3.2 Analysis
From TABLE I we can get the models of polynomial kernels 𝑝𝑝 = 1, 𝑝𝑝 = 2 with
hard margin reaches the highest training set accuracy. While the model of hard margin
with linear kernel achieves the best performance in test set.
For better comparison, I plot the accuracy curve in groups. The first comparison
is based on models with the same margin settings, varying the kernels, as shown in the
upper plot of Fig.1(a) and (b). The other comparison is based on models with the same
kernels, but with different margin settings, shown in the second row of Fig. 1.

(a) Training set accuracy (b) Test set accuracy
Figure 1
From the top image in Fig. 1(a) we can conclude that overall, models with hard
margin performs better in training set, while models with soft margin (𝐶𝐶 = 0.1) have
the poorest performance. The other 3 curves are almost the same (𝐶𝐶 = 0.6,1.1,2.1). The
parameter 𝐶𝐶 controls the cost of violating constraints. A large value suggests hard
penalty on misclassification of training data, which leads to small margin. On the other
hand, a small value of 𝐶𝐶 leads to large margin but with more misclassification of
training data. Thus, when 𝐶𝐶 = 0.1, which is a small value, the accuracy of training set
drops.
However, in the top image of Fig. 1(b), models with soft margin have higher
accuracy in some cases. This is because models trained with soft margin, especially
those with small values of 𝐶𝐶, has better ability to cope with unknown data. If the data
is not linearly separable, which is true in this experiment, a smaller 𝐶𝐶 can help to avoid
overfitting situation.
In the lower image in Fig.1 (a), it’s clear that models with linear kernels
(polynomial kernel with 𝑝𝑝 = 1) have the worst performance. This also suggests that
the training data is not linearly separable. The other models have similar performance,
while models with 𝑝𝑝 = 3 have slightly higher training set accuracy.
On the other hand, as shown in the lower image in Fig.1 (b), models with linear
kernels outperform the other models overall, which is inconsistent with what we
discussed above. In theory, if the data is not linearly separable, the model trained with
linear kernel is not possible to have good performance when it is applied to test data.
One possible explanation could be that the data is close to be linearly separable, with
few data points violating the discriminant function.
3.3 Summary
To sum up, in order to achieve higher accuracy in application, we should adopt
soft margin with a small value of 𝐶𝐶. Also, although models with linear kernel have
outstanding performance in test set, there’s no guarantee that they would perform as
well when trying with new data. Thus, in the above models, the model with soft margin
(𝐶𝐶 = 0.1) and polynomial kernel (𝑝𝑝 = 2) is the best choice.
IV.Task 3: Design SVM
Beside polynomial kernels, SVM can also have kernels with Gaussian radial basis
function (i.e., 𝐾𝐾(𝑥𝑥1, 𝑥𝑥2) = exp (−𝛾𝛾‖𝑥𝑥1 − 𝑥𝑥2‖2) ) and hyperbolic tangent (i.e.,
𝐾𝐾(𝑥𝑥1, 𝑥𝑥2) = tanh (𝜆𝜆𝑥𝑥1 ∙ 𝑥𝑥2 + 𝑐𝑐)).
After implementing some examples, I found that hyperbolic tangent kernels do not
satisfy mercer’s condition. Some further experiments are conducted with Gaussian
radial basis function kernels.
To find a proper SVM model, I randomly selected 600 data samples from the given
website (UC Irvine Machine Learning Repository) and get an evaluation data set.
Trained SVM models are applied to training, test and evaluation data set. The results
are shown in TABLE II.
𝛾𝛾 = 1 𝛾𝛾 = 10 𝛾𝛾 = 100
Train Test Eval Train Test Eval Train Test Eval
𝐶𝐶 = 0.1 0.4045 0.3874 0.4067 0.9950 0.8581 0.2900 0.9995 0.8542 0.2900
𝐶𝐶 = 1 0.4045 0.3874 0.4067 0.9995 0.7083 0.3900 1.0000 0.6126 0
𝐶𝐶 = 10 0.4045 0.3874 0.4067 1 0.3887 0.4067 1 0.6126 0
TABLE II: Accuracy of models with RBF kernels
When 𝐶𝐶 is set to be less than 1, for example, 𝐶𝐶 = 0.1, the accuracy of training
set is very low. When 𝐶𝐶 is given a larger value, for example, 𝐶𝐶 = 10, accuracy in test
and evaluation set is terrible. Also, when 𝛾𝛾 is small, the models have low accuracy in
all three data sets. When 𝛾𝛾 is large, the accuracy in evaluation set is extremely low.
I also experimented with polynomial kernels, but the performance is poor.
Overall, the model with 𝛾𝛾 = 10 and 𝐶𝐶 = 1 has the most balanced performance
among three data sets.

EE5904 Neural Networks Project 1: SVM for Classification of Spam Email Messages solved

Description

Related products

EE5904/ME5404 Neural Networks: Homework #3 solved

EE5904/ME5404 Neural Networks: Homework #1 solved

EE5904 Neural Networks Project 2: Q-Learning for World Grid Navigation solved