# CS 524 Data Mining Programming Assignment 1

\$30.00

## Description

1 Finding the most specific/general hypothesis [WarmUP]
Create a dataset D = (xi
, yi)
N
i=1 where each xi
is given in two dimension xi = (x
1
i
, x2
i
) and
each yi
is a binary label i.e. yi = {0, 1}. First generate the vector Y where each example will
take yi = 1 with probability 1/2 and yi = 0 with probability 1/2 with N = 30. Now fixing
the yi
, sample the X matrix as follows: [20 Marks]
• If yi = 1 then x
1
i ∼ U(2, 7) and x
2
i ∼ U(4, 6). where U(a, b) represent the uniform
distribution between a, b.
• If yi = 0 then x
1
i ∼ U(0, 2) ∪ U(7, 9) and x
2
i ∼ U(1, 3) ∪ U(6, 8).
Implement the following with respect to the above dataset:
1. Color code the examples with yi = 1 as red and yi = 0 as green and plot the dataset.
2. Write a program to find most specific and most general hypothesis when hypothesis
class is considered as all possible Rectangles. Plot both the obtained hypothesis along
with the dataset.
3. Write a program to find most specific and most general hypothesis when hypothesis
class is considered as all possible Circles. Plot both the hypothesis obtained along with
the dataset.
4. Mention any observations corresponding to second and third points.
2 Polynomial Regression in One Dimension [Easy]
In this question, we will repeat the experiments discussed in the class with respect to polynomial regression but with a different function. [30
Marks]
1. Generate 20 data points from function f(x) = cos(2πx) + x
2π + noise where noise∼
N (0, 0.004) with x ranging from 0 to 2π.
2. Fit a polynomial regression with optimal weight vector w
∗ and plot the curves for different degree of polynomials M = 1, 2, 3, 5, 7, 10. Explain your observations by plotting
the data points generated and the curve obtained for different values of M.
3. Repeat the previous experiments with more number of data points and report your
findings.
2
Data Mining Programming Assignment 1
CS 524
January 24, 2022
3 Linear Ridge Regression in Multiple Dimension [Medium]
We will be understanding the concept of linear regression along with the regularization
parameter on the dataset given below which has multiple attributes. Medical Dataset
(https://www.kaggle.com/sudhirnl7/linear-regression-tutorial/data): The medical cost dataset
comprises of independent attributes like age, sex, BMI (body mass index), children, smoker,
and region. The charge/cost is a dependent feature. Our goal is to predict the individual
medical costs billed by the health insurance. [50 Marks]
1. Feature Normalization: As discussed in the class, we first have to standardize all the
features by subtracting with the mean and dividing by the standard deviation. Verify
your technique by computing the mean and variance of the transformed data and check
if the mean is 0 and variance is 1.
2. K−Fold Cross Validation Randomly partition the data into a training, validation,
and test set. Fix 20% of the instances into the test set. For the remaining data perform
the below experiments with K−fold cross validation. You can take the value of K to
be 10.
3. Ridge-Regression: Here, implement your own function ridgereg(X, Y, λ) that calculates the linear least square solution with the ridge regression penalty parameter λ and
return regression weights. Use gradient descent technique to find these weights. Implement predridgereg(X, weights) that returns Y given the input X with learnt weights.
4. Plot the mean square error for each of the dataset obtained from K−fold cross validation
with respect to different λ values. Explain your finding and suggest what value of λ
will you choose based on the obtained plot.
5. Plot the training error, variance and test error against different values of λ. Explain
your finding and suggest what value of λ will you choose based on the obtained plot.
Explain your result in the context of bias variance trade off. Does this value coincide
with the previous question?
3