- Home
- MAST 30034
- MAST30034: Applied Data Science Assignment 1

~~$40.00~~ $24.00

Category: MAST 30034

Description

5/5 - (6 votes)

Assignment Overview

Spatiotemporal datasets consists of both space and time dimensions and represent many

real world applications such as medical imaging data, video data, weather patterns and so

on. The aim of this assignment is to familiarize ourselves with dataset generation by time

samples and pixel values, its preprocessing by statistical measures, its analysis by parameter

estimation, its results visualization by graphs, and its evaluation by performance metrics. In

this regard, you are free to choose any tools (R/Python/Matlab) to perform these tasks.

Assignment Details

In this assignment, you are going to attempt six main tasks given as follows:

1. Generate a spatiotemporal synthetic dataset (that follows a linear regression model),

2. Perform data preprocessing steps,

3. Apply least square regression and its variants on the generated (synthetic) dataset,

4. Visualize the generated dataset to address critical questions,

5. Select parameter by using mean square error (MSE) curve,

6. Use performance metrics to judge a better estimator.

Least Square Regression (LSR)

The least square estimator A = (D>D)

−1D>X solves a linear regression problem of the type

X = DA + E and in terms of cost function it can be written as

min

a

||xv − Dav||2

, v = 1, …, V (1)

where X ∈ R

N×V is the observed dataset, D ∈ R

N×L

contains the set of regressors, A ∈ R

L×V

(known as coefficient matrix) measures the relationship between the observed responses and

the regressors, E accounts for the model error, xv and av are v-th column from X and A,

respectively, and x

k

is the k-th row. Also, N is the number of time points, L is the number

of regressors, and V is the number of observed variables. Throughout this document capital

bold letters represent matrices and small bold letters represent vectors.

Synthetic Dataset Generation

A spatiotemporal synthetic dataset X can be constructed using temporal sources called time

courses TC and spatial sources called spatial maps SM. The model presented in eq. (1) can

be used to generate a synthetic data where D becomes TC, and A becomes SM. On this

1

synthetic dataset X one can then perform LSR using regressors from D (D = TC source

TCs) to estimate response signal strength A (retrieved SMs), and then use A (retrieved SMs)

to estimate D (retrieved TCs).

The source TCs and SMs can either be statistically dependent on each other or independent.

For this assignment, a simple case is considered that avoids spatial dependence, however

temporal correlation would be introduced to address the topic of multicollinearity (MC).

In order to mitigate the problem of MC, variants of LSR that introduce regularization into

their model making them more adaptable at handling MC can be used. Regularization allows

to improve efficiency of the estimate (reduced variance or minimum mean square error) by

introducing a bias into the model that shrinks the estimate of coefficients towards certain

values (small values near zero).

Ridge Regression (RR)

An l-2 norm of av in the objective function (1) is penalized by introducing a regularization

term λ˜ as

min

a

||xv − Dav||2 + λ˜X

L

l=1

||a

l

v

||2

, v = 1, …, V , l = 1, …, L (2)

This results in a ridge regression estimate as A = (D>D + λ˜I)

−1D>X. Here I is an identity

matrix of size L × L, whereas penalty term λ˜ is a scalar value, which has no effect when

its zero and RR produces LSR estimate, however as it reaches infinity its impact increases

and the coefficients in A start getting close to zero. Generally, for our synthetic data case

selecting λ between 0 and 1 would suffice only if we multiply whatever λ we choose between

0 and 1 by number of variables in the dataset, which are V in our case that is λ˜ = λV . We

will use check and guess method for parameter selection (λ) in this case, but we will use sum

of MSE versus regularization parameter plot in case of LASSO regression. Furthermore, RR

will not produce a sparse solution as all coefficients are shrunk towards zero with the same

factor so it fails to eliminate any of them.

LASSO Regression (LR)

LASSO regression on the other hand yields a sparse solution as it can eliminate less important

coefficients by shrinking them towards zeros. Lets use this regression to further enhance the

prediction accuracy for our case. Other than the inability of LSR to handle MC, the main

reason behind the bad performance of LSR and RR (both of them producing many false

positives while recovering coefficients) is that they incorporate the undesired (noise carrying)

pixels into the estimate of A and this also effects the estimate of D in terms of overfitting.

In order to mitigate the problem of both overfitting and multicollinearity, l-1 norm of av in

the objective function (1) is penalized by introducing a regularization term ρ as

min

a

||xv − Dav||2 + ρ

X

L

l=1

|a

l

v

|, v = 1, …, V , l = 1, …, L (3)

2

Luckily its closed form solution exists in soft thresholding as

s =

1

1.1||D>D||2

, thr = N ρs

E = a(i)v + s(D>(xv − Dav))

av =

1

1 + thr

h

sgn

E

◦

|E| − thr1L

+

i

, (4)

where i=1,…,I represents the i-th iteration of the algorithm, 0 < ρ < 1 controls the thresholding of av, sgn(.), |.|,(x)+, and ◦ define the component-wise sign, the component-wise absolute
value, the component-wise max (0, x), and the Hadamard product, respectively. The R code
to solve (4) is given below to answer Question 2.3 and 2.4.
R Code for LR
s t e p <= 1 / ( norm (TC%*%t (TC) ) * 1 . 1 )
t h r <= rho *N* s t e p
Ao <= mat rix ( 0 , n s r c s , 1 )
A <= mat rix ( 0 , n s r c s , 1 )
Alr <= mat rix ( 0 , n s r c s , x1*x2 )
f o r ( k i n 1 : ( x1*x2 ) )
{
A <= Ao+s t e p *( t (TC)%*%(X[ , k] =(TC%*%Ao ) ) )
A <= (1/(1+ t h r ) ) * ( si g n (A)*pmax ( r e p l i c a t e ( n s r c s , 0 ) , abs (A) =t h r ) )
f o r ( i i n 1 : 1 0 )
{
Ao <= A
A <= Ao+s t e p *( t (TC)%*%(X[ , k] =(TC%*%Ao ) ) )
A <= (1/(1+ t h r ) ) * ( si g n (A)*pmax ( r e p l i c a t e ( n s r c s , 0 ) , abs (A) =t h r ) )
}
Alr [ , k]<=A
}
Principal Component Regression (PCR)
The PCR, which also addresses the multicollinearity problem where instead of regressing the
observed variable from X on the design matrix (D) directly, the principal components (PC)
of the design matrix are used as regressors. Only a subset of all the principal components
are used for regression. We can apply PCA on the matrix of regressors D to obtain a new
regressor matrix Z as
min
a
||xv − Dav||2 = min
b
||xv − Zbv||2
, v = 1, ..., V (5)
In order to estimate PCs, you can use svd command in matlab, in python it is found in
numpy library, and in R it is there in MASS library. For instance in Matlab
[U,V,W]= s vd s (X, 5 )
% h e r e d i g i t 5 p oi n t s t o the f a c t t h a t we have onl y decomposed the data t o f i r s t f i v e PCs
% U i s the ei g e n v e c t o r o f XX’ and i t i s where you o b t ai n ”Z” f o r your c a s e
% W i s the ei g e n v e c t o r o f X’X, V a r e the ei g e n v al u e s
You must note down that we did not take PCA of the source TCs in order to generate the
dataset X. We keep TCs intact for dataset generation and actually PCs of the TCs is going
to be used as regressors in Z while estimating B.
3
Figure 1: First TC source (left), random correlation matrix (middle), first SM source (right)
Assignment Questions
In both questions your marks carrying tasks are highlighted by italic text. For both questions
and code given in LR section, value of N is 240, value of V is 441, x1=21 and x2=21 the
size of each slice, value of ρ can be selected between 0 and 1, nsrcs is the number of sources
=6, and X is the standardized generated dataset. For question 1.4, σ stands for standard
deviation. For question 2, always apply absolute operation on retrieved coefficient values
in A as this will make it easier to visualize its plots, and whatever the variant of LSR you
encounter, you can always estimate D using D = XA>

Question 1: Synthetic dataset generation, data preporcessing, & data visualization

1. Construct a matrix TC of size 240 × 6 consisting of six temporal sources using three

vectors, 1) onsets arrival vector (AV) = [0,20,0,0,0,0], 2) increment vector (IV) =

[30,45,60,40,40,40], and 3) duration of ones = [15,20,25,15,20,25]. For instance, you

can generate first TC using these three vectors as AV:IV:N-20 =0:30:220 and ones stay

active for 15 samples, and N here is 240. This TC is also shown in Figure 1 (left).

Mean center each TC by subtracting its mean and standardize each TC by dividing it

by its standard deviation. This will make TCs bias free (centered around the origin)

and equally important (have unit variance). Plot all TCs as six subplots. Why not

normalize (divide by l-2 norm) the TCs instead of standardizing it?

2. A randomly generated correlation matrix (CM) (illustrating uncorrelatedness among

all variables) is shown as a sample in Figure 1 (middle). For your case, construct a CM

that represents correlation values between 6 variables. Show its plot, and can you tell

visually which two TCs are highly correlated? If not, can you tell this from CM?

3. Construct an array tmpSM of size 6×(21×21) consisting of ones and zeros, by placing

ones at these pixels along ”vertical,horizontal” direction of the slice i) 02:06,02:06, ii)

02:06,15:19, iii) 08:13,02:06, iv) 08:13,15:19, v) 15:19,02:06, vi) 15:19,15:19. The first

SM source is also shown in Figure 1 (right). Plot these SMs in six subplots. Reshape

the array tmpSM into a two dimensional matrix and call it SM of size 6 × 441.

Using CM show if these 6 vectored SMs are independent? For our particular case, why

standardization of SMs like TCs is not important? Hint: Imagine one of the slice has

pixel values of 5, while others remain at 1.

4

4. Generate zero mean white Gaussian noise for temporal and spatial sources denoted as

Γt ∈ R

240×6 and Γs ∈ R

6×441. Besides their dimensions, another difference between

spatial and temporal noise is the noise variance, which is 0.25 for Γt

, and 0.015 for

Γs. Using a 6 × 6 CM for each noise type (spatial and temporal) can you show if they

are correlated across sources? Also plot the histogram of both noise sources to see if

they have a normal distribution? Does this normal distribution fulfils the mean and

variance= 1.96σ criteria relating to 0.25, 0.015, and zero mean? Is there product ΓtΓs

correlated across V number of variables?

5. Generate a synthetic dataset X of size 240 × 441 as X = (TC + Γt) × (SM+ Γs). This

builds a dataset that follows the model shown in eq (1). Can these products TC × Γs

and Γt × SM exist, If yes what happened to them because if we keep them then we

cannot fit our model onto (1)? Plot atleast 100 randomly selected time-series from X

as few of them are shown in Figure 2 (left). Also plot variance of all 441 variables on

a separate plot. What information does this plot give you? At the end standardize the

dataset X, because source TCs (regressors) are also standardized and so dataset should

be too.

Question 2: Data analysis, results visualization, & performance metrics

1. The synthetic standardized dataset X that you have generated in Question 1 follows

the linear regression model X = DA + E where the unknown A can be estimated

using least squares. Since the set of regressors D are known as you have used them to

generate X, and these were the TCs so D = TC. Estimate A (retrieval of SMs) using

least square solution ALSR = (D>D)

−1D>X. Similarly, retrieve TCs in DLSR for a

known ALSR using DLSR = XA>

LSR. Plot six retrieved sources using ALSR and DLSR

side by side as shown in Figure 2 (right) for one of the retrieved sources. Do a scatter

plot between 3rd column of DLSR and 30th column of standardized X, you will find a

linear relationship between them, why this does not exist between 4th column of DLSR

and same column of X. Hint: Look at the slices arrangement which source TCs do you

think contributed in building 30th data element.

2. Estimate RR parameters ARR = (D>D + λ˜I)

−1D>X and DRR and then compare

RR to LSR by estimating two correlation vectors retaining only maximum absolute

correlations i) between each TC and DLSR and store it in cT LSR, and ii) between each

TC and DRR and store it in cT RR. Calculate the sum of these two correlation vectors.

P

If you have carefully selected the value of λ between 0 and 1 you must end up with

cT RR >

PcT LSR, and remember λ˜ = λV . Also, for λ = 1000, plot first vector from

ARR and the corresponding vector from ALSR, Do you find all values in a

1

RR shrinking

towards zero?

3. For 21 values of ρ selected between 0 and 1 with an interval of 0.05, estimate LR parameters ALR, DLR, and sum of MSE using LR parameters as P

V

v=1

||X − DLRA2

LR||2

2

/NV ,

and repeat this process 10 times (10 realizations) each time with a new standardized

X (new in terms of Γt and Γs). Then plot average of MSE over these 10 realizations

5

Figure 2: Few variables from X (left), first retrieved spatiotemporal source from X via SR

(right)

against each value of ρ. At what value of ρ do you find the minimum MSE? Is it okay

to select this value? At what value of ρ did MSE started to increase again (LR diverged)

. Here ||.||2 is the Frobenius norm. Hint: Matlab code for MSE calculation over one

realization and one lambda value, here rr=1, first realization, and kk=1, first lambda

MSE( kk , r r ) = sum ( sum ( (Y=Dlr *Alr ) . ˆ 2 ) ) / (N*V)

4. Estimate LR parameters for the value of ρ that you have selected in Question 2.3.

Now, estimate four correlation vectors retaining only maximum absolute correlations

i) between each TC and DRR and store it in cT RR, ii) between each SM and ARR

and store it in cSRR, iii) between each TC and DLR and store it in cT LR, and iv)

between each SM and ALR and store it in cSLR. Calculate the sum of these four

correlation vectors. If you have carefully selected the value of

P

ρ you must end up with

cT LR >

PcT RR and PcSLR >

PcSRR. Plot side by side in form of 4 columns

estimates of D and A for both RR and LR to know the difference visually. You will

see a major difference in estimates of A in terms of false positives. Can you mention

the reason behind this difference?

5. Estimate PCs of the TCs and plot their eigen values. For which PC the eigen value

is the smallest? Plot the regressors in Z and source TCs side by side. Did you notice

deteriorated shape of PCs? Why the shape of TCs has been lost? Now keeping all

components in Z apply lasso regression on X using ρ = 0.001 and then Plot the results

of DP CR and AP CR side by side (note that AP CR = B and your regressors are in Z

(PCs of the TCs)). Did you notice the inferior performance of PCR compared to the

other three regression models? Why is that so?

6

WhatsApp us