## Description

1. Feature Engineering for Environmental Sensor Telemetry Data

In this part of the exercise, we will focus on ‘feature engineering’ (aka, hand-crafted features) for the Environmental

Sensor Telemetry Data. This dataset corresponds to time-series data collected using three identical, custom-built,

breadboard-based sensor arrays mounted on three Raspberry Pi devices. This dataset was created with the hope

that temporal fluctuations in the sensor data of each device might enable machine learning algorithms to determine

when a person is near one of the devices. You can read further about this dataset at Kaggle using the link provided

below. The dataset is stored as a csv file, which is also being provided to you as part of this exercise.

Dataset link: https://kaggle.com/rjconstable/environmental-sensor-telemetry-dataset

1

Dataset csv filename: iot_telemetry_dataset.csv

1. We now turn our attention to preprocessing of the dataset, which includes one-hot encoding of categorical

variables and standardization of non-categorical variables that don’t represent time. Note that no further

preprocessing is needed for this data since the dataset does not have any missing entries.1

(a) One-hot encode the categorical variables of device, light, and motion. In this exercise (and subsequent exercises), you are allowed to use pandas.get_dummies() method for this purpose. (3 points)

(b) Standardize the dataset by making the data associated with co, humidity, lpg, smoke, and temp

variables zero mean and unit variance. Such standardization, however, must be done separately for

data associated with each device. This is an important lesson for practical purposes, as data samples

associated with different devices cannot be thought of as having the same mean and variance. (6 points)

(c) Print the first 20 samples of the preprocessed data (e.g., using the pandas.DataFrame.head() method).

(1 point)

(d) Why do you think the ts variable in the dataset has not been touched during preprocessing? Comment

as much as you can in a markdown cell. (1 point)

(e) Provide two Grouped Bar Charts, with grouping using the three devices, for the original means and

variances (i.e., before standardization) associated with co, humidity, lpg, smoke, and temp variables.

Comment on any observations that you can make from these charts. (3 points)

2. Map the co, humidity, lpg, smoke, and temp variables for each data sample into the following four independent features (5 points):

(a) mean of the five independent variables (e.g., use mean() function in either pandas or numpy)

(b) variance of the five independent variables (e.g., use var() function in either pandas or numpy)

(c) kurtosis of the five independent variables (e.g., use kurtosis() function in scipy.stats)

(d) skewness of the five independent variables (e.g., use skew() function in scipy.stats)

Print the first 40 samples of the transformed dataset (e.g., using the pandas.DataFrame.head() method),

which has four features calculated from five independent variables.

Remark 1: One of the things you will notice is that there are some terminologies being used in descriptions

of the functions in scipy.stats that might not be familiar to you. It is my hope that you can try to get a

handle on these terminologies by digging into resources such as Wikipedia, Stack Overflow, Google,

etc. Helping you become comfortable with the idea of lifelong self-learning is one of the goals of this course.

2. Feature Learning for Synthetically Generated Data

In this part of the exercise, we will focus on ‘feature learning’ using Principal Component Analysis (PCA).

In order to grasp the basic concepts underlying PCA, we limit ourselves in this exercise to synthetically

generated three-dimensional data samples (i.e., p = 3) that actually lie on a two-dimensional subspace (i.e.,

k = 2).

a) In order to create synthetic data in R

3

that lies on a two-dimensional subspace, we need basis vectors

(i.e., a basis) for the two-dimensional subspace. You will generate such a basis (matrix) randomly, as

follows.

1Be aware that real-world data in most problems is never this nice!

2

(i) Create a matrix A ∈ R

3×2 whose individual entries are drawn from a Gaussian distribution with

mean 0 and variance 1 in an independent and identically distributed (iid) fashion. While this can

be accomplished in a number of ways in Python, you might want to use numpy.random.randn()

method for this purpose. Once generated, this matrix should not be changed for the rest of this

exercise. (2 points)

(ii) Matrices with iid Gaussian entries are always full rank, which makes the matrix A a basis matrix

whose column space is a two-dimensional subspace in 3

. Verify this by printing the rank of A; it

should be 2. (1 point) Note: numpy.linalg package (https://docs.scipy.org/doc/numpy/

reference/routines.linalg.html) is one of the best packages for most linear algebra operations in Python.

(iii) Note that the basis vectors in A are neither unit-norm, nor orthogonal to each other. Verify this by

printing the norm of each vector in A as well as the inner product between the two vectors in A. (1

point)

(iv) Let S denote the subspace corresponding to the column space of the matrix A. Generate and print

three unique vectors that lie in the subspace S. (1 point)

b) We now turn our attention to generation of synthetic data. We will resort to a ‘random’ generation

mechanism for this purpose. Specifically, each of our (unlabeled) data sample x ∈ R

3

is going to be

generated as follows: x = Ab, where b ∈ R

2

is a random vector whose entries are iid Gaussian with

mean 0 and variance 1. Note that we will have a different b for each new data sample (i.e., unlike A, it

is not fixed for each data sample).

(i) Generate 250 data samples {xi}

250

i=1

using the aforementioned mathematical model. (2 points)

(ii) Does each data sample xi

lie in the subspace S? Justify your answer. (1 point)

(iii) Store the data samples into a data matrix X ∈ R

n×p

such that each data sample is a row in this data

matrix. What is n and p in this case? Print the dimensionality of X and confirm it matches your

answer. (3 points)

(iv) Since we can write X

T = AB, where B ∈ R

2×250 is a matrix whose columns are the vectors

bi’s corresponding to data samples xi’s, the rank of X is 2 (Can you see why? Perhaps refer to

Wikipedia?). Verify this by printing the rank of X. (1 point)

c) Before turning our attention to calculation of PCA features for our data samples, we first investigate the

relationship between eigenvectors of the scaled covariance matrix X

TX and the right singular vectors

of X.

(i) Compute the singular value decomposition (SVD) of X and the eigenvalue decomposition (EVD)

of X

TX and verify (by printing) that:

(a) The right singular vectors of X correspond to the eigenvectors of X

TX. Hint: Recall that eigenvalue decomposition does not necessarily list the eigenvalues in decreasing order. You would

need to be aware of this fact to appropriately match the eigenvectors and singular vectors. (2

points)

(b) The eigenvalues of X

TX are square of the singular values of X. (2 points)

(c) The energy in X, defined by kXk

2

F

, is equal to sum of squares of the singular values of X. (2

points)

(ii) Since the rank of X is 2, it means that the entire dataset spans only a two-dimensional subspace in

R

3

. We now dig a bit deeper into this.

(a) Since rank of X is 2, we should ideally only have two nonzero singular values of X. However,

unless you are really lucky, you will see that none of your singular values are exactly zero.

Comment on why that might be happening (and if you are the lucky one then run your code

again and you will hopefully become unlucky :). (2 points)

3

(b) What do you think is the relationship between the right singular vectors of X corresponding

to the two largest singular values and the subspace S? Try to be as precise and mathematically

rigorous as you can. (3 points)

d) We finally turn our attention to PCA of the synthetic dataset, which is stored in matrix X. Our focus in

this problem is on computing of PCA features for k = 2, computation of projected data (also termed

reconstructed data), and the sum of squared errors (also termed representation error or PCA error).

(i) Since each data sample xi

lies in a three-dimensional space, we can have up to three principal

components of this data. However, based on your knowledge of how the data was created (and

subsequent discussion above), how many principal components should be enough to capture all

variation in the data? Justify your answer as much as you can, especially in light of the discussion

in class. (2 points)

(ii) While mean centering is an important preprocessing step for PCA, we do not necessarily need to

carry out mean centering in this problem since the mean vector for this dataset will have very small

entries. Indeed, if we let x1, x2, and x3 denote the first, second, and third component of the random

vector x then it follows that E[xk] = 0, k = 1, 2, 3.

i Formally show that E[xk] = 0, k = 1, 2, 3, for our particular data generation method. (3 points)

ii Compute the (empirical) mean vector bµ from the data matrix X and verify by printing that its

entries are indeed small. (2 points)

(iii) Compute the top two principal component directions (loading vectors) U =

h

u1 u2

i

of this dataset

and print them. (3 points)

(iv) Compute feature vectors exi from data samples xi by ‘projecting’ data onto the top two principal

component directions of X. (2 points)

(v) Reconstruct (approximate) the original data samples xi from the PCA feature vectors exi by computing bxi = Uexi

. (2 points)

(vi) Ideally, since the data comes from a two-dimensional subspace, the representation error (aka, the

PCA error)

Xn

i=1

kbxi − xik

2

2 = kXb − Xk

2

F

should be zero. Verify (unless, again, you are super lucky) that this is, in fact, not the case. This

error, however, is so small that it can be treated as zero for all practical purposes. (2 points)

(vii) Now compute feature vectorsexi from data samples xi by projecting data onto only the top principal

component direction of X. (2 points)

(viii) Reconstruct (approximate) the original data samples xi from the PCA feature vectors exi by computing bxi = u1exi

. (2 points)

(ix) Compute the representation error kXb − Xk

2

F

and show that this error is equal to the square of the

second-largest singular value of X. (2 points)

(x) Using mpl_toolkits.mplot3d, display two 3D scatterplots corresponding to the original data

samples xi and the reconstructed data samples bxi corresponding to the top principal component.

Comment on the shape of the scatterplot for the reconstructed samples and the mathematical reason

for this shape. (4 points)

4