CSE 544, Probability and Statistics for Data Science Assignment 3: Non-Parametric InferenceΒ 

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment

Description

5/5 - (4 votes)

1. MSE in terms of bias (Total 5 points)
For some estimator πœƒοΏ½, show that MSE = bias2
(πœƒοΏ½) + Var(πœƒοΏ½). Show your steps clearly.
2. Programming fun with 𝑭� (Total 17 points)
For this question, we require some programming; you should only use Python. You may use the scripts
provided on the class website as templates. Do not use any libraries or functions to bypass the
programming effort. Please submit your code in the google form (will be announced) with sufficient
documentation so the code can be evaluated. Attach each plot as a separate sheet to your submission.
All plots must be neat, legible (large fonts), with appropriate legends, axis labels, titles, etc.
(a) Write a program to plot 𝐹� (empirical CDF or eCDF) given a list of samples as input. Your plot must
have y-limits from 0 to 1, and x-limits from 0 to the largest sample. Show the input points as crosses
on the x-axis. (3 points)
(b) Use an integer random number generator with range [1, 99] to draw n=10, 100, and 1000 samples.
Feed these as input to (a) to generate three plots. What do you observe? (2 points)
(c) Modify (a) above so that it takes as input a collection of list of samples; that is, a 2-D array of sorts
where each row is a list of samples (as in (a)). The program should now compute the average 𝐹�
across the rows and plot it. That is, first compute the 𝐹� for each row (student), then average them
all out across rows, and plot the averageοΏ½
𝐹. Show all input points as crosses on the x-axis. (3 points)
(d) Use the same integer random number generator from (b) to draw n=10 samples for m=10, 100,
1000 rows. Feed these as input to (d) to generate three plots. What do you observe? (2 points)
(e) Modify the program from (a) to now also add 95% Normal-based CI lines for 𝐹�, given a list of
samples as input. Draw a plot showing 𝐹� and the CI lines for the q2.dat data file (799 samples) on
the class website. Use x-limits of 0 to 2, and y-limits of 0 to 1. (3 points)
(f) Modify the program from (e) to also add 95% DKW-based CI lines for 𝐹�. Draw a single plot showing
𝐹� and both sets of CI lines (Normal and DKW) for the q2.dat data. Which CI is tighter? (4 points)
3. Plug-in estimates (Total 10 points)
(a) Show that the plug-in estimator of the variance of X is 𝜎�2 = 1
𝑛 βˆ‘ (𝑋𝑖 βˆ’ 𝑋�𝑛) 𝑛 2
𝑖=1 , where 𝑋�𝑛 is the
sample mean, 𝑋�𝑛 = 1
𝑛 βˆ‘ 𝑋𝑖 𝑛
𝑖=1 . (3 points)
(b) Show that the bias of 𝜎�2 is βˆ’ 𝜎2/𝑛, where 𝜎2 is the true variance. (4 points)
(c) The kurtosis for a RV X with mean πœ‡ and variance 𝜎2 is defined a𝑠 πΎπ‘’π‘Ÿπ‘‘[𝑋] = 𝐸[(𝑋 βˆ’ πœ‡)4] ⁄ 𝜎4 .
Derive the plug-in estimate of the kurtosis in terms of the sample data. (3 points)
4. Consistency of eCDF (Total 10 points)
Let D={X1, X2, …, Xn} be a set of i.i.d. samples with true CDF F. Let 𝐹� be the eCDF for D, as defined in
class.
(a) Derive E(𝐹�) in terms of F. Start by writing the expression for 𝐹� at some α. (3 points)
(b) Show that bias(𝐹�) = 0. (2 points)
(c) Derive se(𝐹�) in terms of F and n. (3 points)
(d) Show that 𝐹� is a consistent estimator. (2 points)
5. Histogram estimator (Total 13 points)
Histogram is a representation of sample data grouped by bins. Consider a true distribution X with range
[0, 1). Let m ∈ Z+ and b < 1 be such that m βˆ™ b = 1, where m is the number of bins and b is the bin size. Bin
i, denoted as 𝐡𝑖, where 1 ≀ i ≀ m, contains all data samples that lie in the range [(π‘–βˆ’1)
π‘š ,
𝑖
π‘š).
(a) Let 𝑝𝑖 denote the probability that the true distribution lies in 𝐡𝑖. As in class, derive π‘οΏ½πš€ in terms of
indicator RVs of i.i.d. data samples (the 𝑋𝑖) drawn from the true distribution, X. (3 points)
(b) The histogram estimator for some π‘₯ ∈ [0, 1) is defined asοΏ½β„Ž(π‘₯) = 𝑝�πš₯
𝑏
οΏ½ , where π‘₯ ∈ 𝐡𝑗. Show that
πΈοΏ½β„ŽοΏ½(π‘₯)οΏ½ = 𝑓(π‘₯) when bβ†’0, where f(x) is the true pdf of X. (4 points)
(c) Use all of the weather.dat data on the class website and plot its histogram estimate (that is, plot the
οΏ½β„Ž(π‘₯) = 𝑝�πš₯
𝑏 = 𝑝̂
𝑗 βˆ€ π‘₯ ∈ 𝐡𝑗) using python with a bin size of 1. Do not use any in-built libraries to
bypass the programming effort. Use the same instructions as in Q2 for legibility and format of plot
submissions. Submit your code via the google form, labeled as q5.py. (3 points)
(d) Now use the histogram estimator (οΏ½β„Ž(π‘₯) = 𝑝�πš₯
𝑏
βˆ€ π‘₯ ∈ 𝐡𝑗; 𝑏 = 1) as an estimate of pdf based on the
weather.dat dataset. Based on these pdf estimates, plot the CDF of the dataset using python. Attach
the plot to your hardcopy submission. (3 points)
6. Properties of estimators (Total 5 points)
Find the bias, se, and MSE in terms of πœƒ for πœƒοΏ½ = 1
𝑛 βˆ‘ 𝑋𝑖 𝑛
𝑖=1 , where Xi are i.i.d. ~ Poisson(ΞΈ). Show your
work. Hint: Follow the same steps as in class, assuming the true distribution is unknown. Only at the
very end use the fact that the unknown distribution is Poisson(ΞΈ) to get the final answers in terms of πœƒ.
7. Kernel density estimation (Total 15 points)
As usual, submit all code for this Q on the google form. The histogram density estimation has several
drawbacks such as discontinuities in the estimate and density estimate depends on the starting of the
bin. To alleviate these short coming to a certain extent, we will use another type of non-parametric
density estimation technique called Kernel density estimation (KDE). The formal definition of KDE is :
For a data sample 𝐷 = {𝑋1,𝑋2, … , 𝑋𝑛}. The KDE of any point π‘₯ is given by:
𝑝̂
𝐾𝐷𝐸(π‘₯) = 1
π‘›β„ŽοΏ½ 𝐾(
𝑋𝑖 βˆ’ π‘₯
β„Ž )
𝑛
𝑖=1
where 𝐾(. ) is called the kernel function which should be a smooth, symmetric and a valid density
function. Parameter β„Ž > 0 is called the smoothing bandwidth that controls the amount of smoothing.
(a) Density Estimation: Generate a sample of 800 data points 𝐷 = {𝑋1,𝑋2, … , 𝑋800 } which are i.i.d. and
sampled from a Mixture of Normal distributions such that with prob. 0.25, it is Nor(0,1), with prob.
0.25, it is Nor(3,1), with prob. 0.25 it is Nor(6,1), and with remaining prob. 0.25, it is Nor(9,1). Note
that this is the true distribution. A simple way to sample data from this distribution is to sample a
RV 𝑋~π‘ˆ[0, 1], if 𝑋 ≀ 0.25 sample from the 1st Normal (that is, Nor(0,1)), if 𝑋 ∈ (0.25, 0.5] then
sample from 2nd Normal (that is, Nor(3,1)), and so on. Now obtain the KDE estimate of the PDF
𝑝̂
𝐾𝐷𝐸 (𝛼) for 𝛼 ∈ {βˆ’5, βˆ’4.9, βˆ’4.8, … , 10 } (use np.arange(-5, 10, 0.1)) using Parzen window
kernel, where the density estimate is defined by
𝑝̂
𝐾𝐷𝐸 (𝛼) = 1
π‘›β„Ž βˆ‘ 𝐼{|𝛼 βˆ’ π‘₯𝑖| ≀ β„Ž
2
} 500
𝑖=1 , where I() is the indicator RV.
Write a python function which takes as input (a) Data (𝐷) and (b) Smoothing bandwidth (β„Ž), and
returns a list of KDE estimates for all the points in the list np.arange(-5, 10, 0.1)). Using this
function, generate plots (in the same figure) of the KDE estimate of the PDF for all the values of
𝛼 ∈ {βˆ’5, βˆ’4.9, βˆ’4.8, … , 10 } for the values of β„Ž ∈ {.1, 1, 7} along with the true PDF of 𝛼; note that
the true distribution is the mixture of Normals stated above. To numerically get the pdf of a given
Normal in python, try scipy.stats.norm.pdf. The master plot should have on x-axis the alpha values,
ranging from -5 to 10. You should have 4 lines: one for each h value and one for the true
distribution. Make sure to have a useful legend. What are your observations regarding the effect of
β„Ž on KDE estimate 𝑝̂
𝐾𝐷𝐸 (𝛼) ? (8 points)
(b) Bias and Variance: Now we will study the effect of parameter β„Ž on the bias and variance of the KDE
estimates. Repeat the trial of generating 800 data points 150 times in the same way as above. Let
each row represent a trial (so you should have a matrix with 150 rows, and each row (trial) should
have 800 columns). Let 𝑝̂
𝑖
𝐾𝐷𝐸 (𝛼) = KDE estimate at 𝛼 for the 𝑖
π‘‘β„Ž trial and let 𝑝(𝛼) be the true pdf
at 𝛼. Then the expectation, bias, Var and MSE are given by:
𝐸[𝑝̂
𝐾𝐷𝐸 (𝛼)] = 1
150 βˆ‘ 𝑝̂
𝑖
𝐾𝐷𝐸 (𝛼) 150
𝑗=1 ,
π‘‰π‘Žπ‘ŸοΏ½π‘Μ‚
𝐾𝐷𝐸 (𝛼)οΏ½ = 1
150 βˆ‘ �𝑝̂
𝑖
𝐾𝐷𝐸 (𝛼) βˆ’ 𝐸[𝑝̂
𝐾𝐷𝐸 (𝛼)]οΏ½
2 150
𝑖=1 ,
π΅π‘–π‘Žπ‘ οΏ½π‘Μ‚
𝐾𝐷𝐸 (𝛼)οΏ½ = ( 1
150 βˆ‘ 𝑝̂
𝑖
𝐾𝐷𝐸 (𝛼)) βˆ’ 𝑝(𝛼) 150
𝑗=1 ,
𝑀𝑆𝐸�𝑝̂
𝐾𝐷𝐸 (𝛼)οΏ½ = π‘‰π‘Žπ‘ŸοΏ½π‘Μ‚
𝐾𝐷𝐸 (𝛼)οΏ½ + π΅π‘–π‘Žπ‘ 2�𝑝̂
𝐾𝐷𝐸 (𝛼)οΏ½.
To observe the effect of β„Ž on the bias and variance, first calculate the total bias and variance (across
all points) as π΅π‘–π‘Žπ‘ π‘‘π‘œπ‘‘
2 (β„Ž) = βˆ‘ π΅π‘–π‘Žπ‘ 2 π›Όβˆˆπ‘† �𝑝�𝐾𝐷𝐸 (𝛼)οΏ½
|𝑆| and π‘‰π‘Žπ‘Ÿπ‘‘π‘œπ‘‘(β„Ž) = βˆ‘ π‘‰π‘Žπ‘ŸοΏ½π‘οΏ½πΎπ·πΈ (𝛼𝑖 π›Όβˆˆπ‘† )οΏ½
|𝑆| ,
where 𝑆 = {βˆ’5, βˆ’4.9, βˆ’4.8, … , 10 } is the set of points for which you are estimating the density.
Write python code to solve the following questions:
(i) For each value of β„Ž ∈ {0.01, .1, .3, .6, 1, 3, 7} calculate the bias and variance as defined above
and generate two plots, one for π΅π‘–π‘Žπ‘ π‘‘π‘œπ‘‘
2 (β„Ž) vs β„Ž and another for π‘‰π‘Žπ‘Ÿπ‘‘π‘œπ‘‘(β„Ž) vs β„Ž. What do you
observe from these plots? (5 points)
(ii) If we use 𝑀𝑆𝐸 as a measure to select the optimal β„Ž, i.e., β„Žβˆ— = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯β„Ž(π‘‰π‘Žπ‘Ÿπ‘‘π‘œπ‘‘(β„Ž) +
π΅π‘–π‘Žπ‘ π‘‘π‘œπ‘‘
2 (β„Ž)), what is the optimal value of β„Ž you should use? (2 points)