Description
This project consists of three simulation studies. Unlike a homework assignment, these “exercises” are not broken down into parts (e.g., a, b, c), and so your analysis will not be similarly partitioned. Instead, your document should be organized more like a true project report, and it should use the overall format:
- Simulation Study 1
- Simulation Study 2
- Simulation Study 3
Within each of the simulation studies, you should use the format:
- Introduction
- Methods
- Results
- Discussion
The introduction section should relay what you are attempting to accomplish. It should provide enough background to your work such that a reader would not need this directions document to understand what you are doing. (Basically, assume the reader is mostly familiar with the concepts from the course, but not this project.)
The methods section should contain the majority of your “work.” This section will contain the bulk of the R
code that is used to generate the results. Your R
code is not expected to be perfect idiomatic R
, but it is expected to be understood by a reader without too much effort. Use RMarkdown and code comments to your advantage to explain your code if needed.
The results section should contain numerical or graphical summaries of your results as they pertain to the goal of each study.
The discussion section should contain discussion of your results. The discussion section should contain discussion of your results. Potential topics for discussion are suggested at the end of each simulation study section, but they are not meant to be an exhaustive list. These simulation studies are meant to be explorations into the principles of statistical modeling, so do not limit your responses to short, closed form answers as you do in homework assignments. Use the potential discussion questions as a starting point for your response.
- Your resulting
.html
file will be considered a self-contained “report,” which is the material that will determine the majority of your grade. Be sure to visibly include allR
code and output that is relevant. (You should not include irrelevant code you tried that resulted in error or did not answer the question correctly.) - Grading will be based on a combination of completing the required tasks, discussion of results,
R
usage, RMarkdown usage, and neatness and organization. For full details see the provided rubric. - At the beginning of each of the three simulation studies, set a seed equal to your birthday, as is done on homework. (It should be the first code run for each study.) These should be the only three times you set a seed.
birthday = 18760613
set.seed(birthday)
Simulation Study 1: Significance of Regression
In this simulation study we will investigate the significance of regression test. We will simulate from two different models:
- The “significant” model
where ϵi∼N(0,σ2)��∼�(0,�2) and
- β0=3�0=3,
- β1=1�1=1,
- β2=1�2=1,
- β3=1�3=1.
- The “non-significant” model
where ϵi∼N(0,σ2)��∼�(0,�2) and
- β0=3�0=3,
- β1=0�1=0,
- β2=0�2=0,
- β3=0�3=0.
For both, we will consider a sample size of 2525 and three possible levels of noise. That is, three values of σ�.
- n=25�=25
- σ∈(1,5,10)�∈(1,5,10)
Use simulation to obtain an empirical distribution for each of the following values, for each of the three values of σ�, for both models.
- The F� statistic for the significance of regression test.
- The p-value for the significance of regression test
- R2�2
For each model and σ� combination, use 20002000 simulations. For each simulation, fit a regression model of the same form used to perform the simulation.
Use the data found in study_1.csv
for the values of the predictors. These should be kept constant for the entirety of this study. The y
values in this data are a blank placeholder.
Done correctly, you will have simulated the y
vector 2(models)×3(sigmas)×2000(sims)=120002(������)×3(������)×2000(����)=12000 times.
Potential discussions:
- Do we know the true distribution of any of these values?
- How do the empirical distributions from the simulations compare to the true distributions? (You could consider adding a curve for the true distributions if you know them.)
- How are each of the F� statistic, the p-value, and R2�2 related to σ�? Are any of those relationships the same for the significant and non-significant models?
Additional things to consider:
- Organize the plots in a grid for easy comparison.
Simulation Study 2: Using RMSE for Selection?
In homework we saw how Test RMSE can be used to select the “best” model. In this simulation study we will investigate how well this procedure works. Since splitting the data is random, we don’t expect it to work correctly each time. We could get unlucky. But averaged over many attempts, we should expect it to select the appropriate model.
We will simulate from the model
where ϵi∼N(0,σ2)��∼�(0,�2) and
- β0=0�0=0,
- β1=3�1=3,
- β2=−4�2=−4,
- β3=1.6�3=1.6,
- β4=−1.1�4=−1.1,
- β5=0.7�5=0.7,
- β6=0.5�6=0.5.
We will consider a sample size of 500500 and three possible levels of noise. That is, three values of σ�.
- n=500�=500
- σ∈(1,2,4)�∈(1,2,4)
Use the data found in study_2.csv
for the values of the predictors. These should be kept constant for the entirety of this study. The y
values in this data are a blank placeholder.
Each time you simulate the data, randomly split the data into train and test sets of equal sizes (250 observations for training, 250 observations for testing).
For each, fit nine models, with forms:
y ~ x1
y ~ x1 + x2
y ~ x1 + x2 + x3
y ~ x1 + x2 + x3 + x4
y ~ x1 + x2 + x3 + x4 + x5
y ~ x1 + x2 + x3 + x4 + x5 + x6
, the correct form of the model as noted abovey ~ x1 + x2 + x3 + x4 + x5 + x6 + x7
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8
y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9
For each model, calculate Train and Test RMSE.
Repeat this process with 10001000 simulations for each of the 33 values of σ�. For each value of σ�, create a plot that shows how average Train RMSE and average Test RMSE changes as a function of model size. Also show the number of times the model of each size was chosen for each value of σ�.
Done correctly, you will have simulated the y� vector 3×1000=30003×1000=3000 times. You will have fit 9×3×1000=270009×3×1000=27000 models. A minimal result would use 33 plots. Additional plots may also be useful.
Potential discussions:
- Does the method always select the correct model? On average, does is select the correct model?
- How does the level of noise affect the results?
Simulation Study 3: Power
In this simulation study we will investigate the power of the significance of regression test for simple linear regression.
Recall, we had defined the significance level, α�, to be the probability of a Type I error.
Similarly, the probability of a Type II error is often denoted using β�; however, this should not be confused with a regression parameter.
Power is the probability of rejecting the null hypothesis when the null is not true, that is, the alternative is true and β1�1 is non-zero.
Essentially, power is the probability that a signal of a particular strength will be detected. Many things affect the power of a test. In this case, some of those are:
- Sample Size, n�
- Signal Strength, β1�1
- Noise Level, σ�
- Significance Level, α�
We’ll investigate the first three.
To do so we will simulate from the model
where ϵi∼N(0,σ2)��∼�(0,�2).
For simplicity, we will let β0=0�0=0, thus β1�1 is essentially controlling the amount of “signal.” We will then consider different signals, noises, and sample sizes:
- β1∈(−2,−1.9,−1.8,…,−0.1,0,0.1,0.2,0.3,…1.9,2)�1∈(−2,−1.9,−1.8,…,−0.1,0,0.1,0.2,0.3,…1.9,2)
- σ∈(1,2,4)�∈(1,2,4)
- n∈(10,20,30)�∈(10,20,30)
We will hold the significance level constant at α=0.05�=0.05.
Use the following code to generate the predictor values, x
: values for different sample sizes.
x_values = seq(0, 5, length = n)
For each possible β1�1 and σ� combination, simulate from the true model at least 10001000 times. Each time, perform the significance of the regression test. To estimate the power with these simulations, and some α�, use
It is possible to derive an expression for power mathematically, but often this is difficult, so instead, we rely on simulation.
Create three plots, one for each value of σ�. Within each of these plots, add a “power curve” for each value of n� that shows how power is affected by signal strength, β1�1.
Potential discussions:
- How do n�, β1�1, and σ� affect power? Consider additional plots to demonstrate these effects.
- Are 10001000 simulations sufficient?