Description
1. Correlations.
• When given a data matrix, an easy way to tell if any two columns are correlated is to
look at a scatter plot of each column against each other column. For a warm up, do this:
Look at the data in DF1 in Lab2.zip. Which columns are (pairwise) correlated? Figure
out how to do this with Pandas, and also how to do this with Seaborn.
• Compute the covariance matrix of the data. Write the explicit expression for what this
is, and then use any command you like (e.g., np.cov) to compute the 4 × 4 matrix.
Explain why the numbers that you get fit with the plots you got.
• The above problem in reverse. Generate a zero-mean multivariate Gaussian random
variable in 3 dimensions, Z = (X1, X2, X3) so that (X1, X2) and (X1, X3) are uncorrelated, but (X2, X3) are correlated. Specifically: choose a covariance matrix that has the
above correlations structure, and write this down. Then find a way to generate samples
from this Gaussian. Choose one of the non-zero covariance terms (Cij , if C denotes
your covariance matrix) and plot it vs the estimated covariance term, as the number of
samples you use scales. The goal is to get a visual representation of how the empirical
covariance converges to the true (or family) covariance.
2. Outliers. Consider the two-dimensional data in DF2 in Lab2.zip. Look at a scatter plot
of the data. It contains two points that look like potential outliers. Which one is “more”
outlying? Propose a transformation of the data that makes it clear that the point at (−1, 1)
is more outlying than the point at (5.5, 5), even though the latter point is“farther away”
from the nearest points. Plot the data again after performing this transformation. Provide
discussion as appropriate to justify your choice of transformation. Hint: if y comes from a
standard Gaussian in two dimensions (i.e., with covariance equal to the two by two identity
matrix), and
Q =
2
1
2
1
2
2
,
what is the covariance matrix of the random variable z = Qy? If you are given z, how would
you create a random Gaussian vector with covariance equal to the identity, using z?
3. Even More Standard Error (This is to be completed only after you’ve completed the
last written exercise below). In one of the written exercises below, you derive an expression
1
for what is called the Standard Error: where β denotes the “truth,” βˆ denotes the value we
compute using least squares linear regression, and Z and e are as in the exercise below, you
find:
βˆ − β = Ze.
If we know the distribution of the noise (the distribution generating the noise vectors, ei),
then we know the distribution for the error, (βˆ − β). This allows us to answer the question
given in class: if we solve a regression and obtain value βˆ, how can we tell if it is statistically
significant? The answer is: we compare the size of βˆ to the spread introduced by the noise
(i.e., the standard error), and we ask: what is the likelihood that the true β = 0, and what
we observed was purely due to the noise.
If the noise is Gaussian (normal), i.e., ei ∼ N(0, σ2
), and if the values of the xi are normalized,
then we expect error of the size σ/√
n, as this is roughly the standard deviation of the
expression for the error that you derive above. This means: if you have twice the data points,
you should expect the error to be reduced by about 1.4 (the formula says that the standard
deviation of the error would decrease by a factor of 1/
√
2).
Compute this empirically, as follows: We will generate data for a regression problem, solve
it, and see what the error is: Generate data as I did in the example from class: xi ∼ N(0, 1),
ei ∼ N(0, 1). Generate y by yi = β0 + xiβ + ei
, where β0 = −3 and β = 0. Note that
since β = 0, this means that y and x are unrelated! The question we are exploring here is as
follows: when we solve a regression problem, we are not going to find βˆ = 0 – we will find
that βˆ takes some other values, hopefully close to zero. How do we know if the value of βˆ we
get is statistically meaningful?
• By creating fresh data and each time computing βˆ and recording βˆ − β, compute the
empirical standard deviation of the error for n = 150 (the number we used in class). In
class, in the exercise where I tried to find a linear regression of y vs. noise, we found
βˆ = −0.15. Given your empirical computation of the standard deviation of the error,
how significant is the value −0.15?
• Now repeat the above experiment for different values of n. Plot these values, and on the
same plot, plot 1/
√
n. How is the fit?
4. Names and Frequencies. The goal of this exercise is for you to get more experience with
Pandas, and to get a chance to explore a cool data set. Download the file Names.zip from
Canvas. This contains the frequency of all names that appeared more than 5 times on a social
security application from 1880 through 2015.
• Write a program that on input k and XXXX, returns the top k names from year
XXXX.
• Write a program that on input Name returns the frequency for men and women of the
name Name.
• It could be that names are more diverse now than they were in 1880, so that a name may
be relatively the most popular, though its frequency may have been decreasing over the
years. Modify the above to return the relative frequency. Note that in the next coming
lectures we will learn how to quantify diversity using entropy.
• Find all the names that used to be more popular for one gender, but then became more
popular for another gender.
2
• (Optional) Find something cool about this data set.
5. Visualization Tools and Missing/Hidden Values. Visualization is important both for
exploring the data, as well as for explaining what you have done. There are a huge number
of such tools now available. This exercise walks through various functionalities of matplotlib
and pandas.
• The first part of this exercise was created by Dataquest. Run through the commands
given in this tutorial: https://www.dataquest.io/blog/matplotlib-tutorial/ and understand the code.
• Suppose that you would now like to plot some of the results by state. As you will see,
the state information is sometimes missing, and other times it comes in varying forms.
Figure out how to aggregate the results by state. The challenge here: how many of the
tweets can you (correctly) assign to a state? Note: depending on how well you want to
do (i.e., how many tweets you want to correctly assign to their state), this is not an easy
problem!
6. More Visualization Tools – Optional. This exercise was also created by Dataquest. Run
through the exercise https://www.dataquest.io/blog/python-data-visualization-libraries/
for more visualization tools, including some that allow you to plot points on a map, and also
to create interactive maps (zoom in, etc.).
Written Questions
1. Standard Error: It is important to develop an intuition for how much error we should
“expect” when we solve a particular statistical problem. As the number of sample increase, we
should expect the error to decrease. But by how much? In the first lab, you generated samples
from a univariate (Problem 3) and multivariate (Problem 4) Gaussian with given parameters,
and then you were asked to estimate those parameters from the data you generated. In this
exercise, we derive explicitly the relationship that you (should have) observed doing those
exercises.
Suppose Z ∼ N(µ, σ2
), i.e., Z is a univariate Gaussian (a.k.a. normal) random variable with
mean µ and variance σ
2
. Suppose that you see n samples from Z, i.e., you see data z1, . . . , zn.
Let zavg =
Pn
i=1 zi/n denote the sample mean. We want to answer: how close is zavg to µ?
Note that zavg is a random variable so we need to quantify in a probabilistic way how close
zavg is to µ.
• Suppose Z ∼ N(0, 1). This is also called a standard normal random variable. For
n = 10, 000, compute the probability that zavg > 0.1, zavg > 0.01, and zavg > 0.001.
• Now for the general case: suppose Z ∼ N(µ, σ2
), and for general n, compute the probability that zavg−µ > n−1/3
, zavg−µ > n−1/2
, and zavg−µ > n−2/3
. For your calculations,
you can let n scale if that makes things easier.
2. More Standard Error Consider a one dimensional regression problem, where the offset is
zero. Thus, we are trying to fit a function of the form h(x) = x · β. Suppose that the truth
is a noisy version of this – that is, the true model according to which data are generated is:
yi = xi
· β + ei
.
3
Everything in the above equation is a scalar, i.e., yi
, xi
, β, ei ∈ R. Here, ei represents independent noise that is not modeled by the linear relationship.
• When we have n data points, the least squares objective reads:
min
β
:
1
n
Xn
i=1
(xiβ − yi)
2
.
Show that this is a quadratic function in β, that is, if we expand it, it has the form
Aβ2 + Bβ + C.
• Compute A, B, and C explicity, i.e., as explicit functions of the data, {xi
, yi}. Note that
these should not be functions of β. Show that A ≥ 0 regardless of the values of the data.
• Since A ≥ 0, this is a quadratic function whose graph opens up. This means that it is
convex, and therefore the solution is characterized as the solution obtained by setting
the first derivative (w.r.t. β) equal to zero. Do this, and therefore explicitly solve for the
solution βˆ. This is the one-dimensional form of what is known as the normal equations.
Hint: we did this problem in class.
• Now using the one dimensional expression from the second part, and plugging in the
relationship yi = xi
· β + ei
, write
βˆ = β + Ze,
where e denotes the vector of all the errors, ei
, added in each stage, and where Z is a
matrix of appropriate dimension. What is Z, explicitly?
(Optional) Repeat the last two questions in the general case. That is, derive the normal equations
and the standard error for the general (vector) case, where our model is
yi = x
>
i β + ei
,
where now xi
, β ∈ R
p
, and x
>
i β denotes the dot product.
4