Solved Problem Set 1 SOC-GA 2332 Intro to Stats

$30.00

Category: Tags: , , , , , You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (1 vote)

1 Functions

Recall the formulas for population mean:
µ =
1
N
X
N
i=1
xi (1)
and variance:
σ
2 =
1
N
X
N
i=1
(xi − µ)
2
(2)
where N is the population size.

1. Write a function in R that calculates the population mean according to Equation 1 without using
any R functions that directly calculate the mean. For example, you cannot use mean() from base
R, or summarise(., mean = mean()) from tidyverse.
• Name your function pop mean.
• The function should take a numeric vector as its input.
• The function should return a numeric variable that is the population mean calculated based
on the vector input.

2. Write a function in R that calculates the population variance according to Equation 2 without
using any R functions that directly calculate the variance. For example, you cannot use var() from
base R, or summarise(., var = var()) from tidyverse.
• Name your function pop var.
• The function should take a numeric vector as its input.
• The function should return a numeric variable that is the population variance calculated based
on the vector input.
• You can use the pop mean() function you just created for your pop var() function.

3. Import gapminder.csv to your R environment.
• Apply the two functions you just created to the lifeExp variable in gapminder.
• Use R functions that directly calculate mean and variance to the same lifeExp variable vector.
• Report your results of the above two steps either in text or in a table. The results for the
mean should be equal, but the results for variance should be different. Find out and explain
why the results in variance differ.

Note: For this exercise, we will assume that there is no missing values (i.e. no NAs) in the vector, so you
don’t need to consider how to deal with NA values. Hint: You can format tables using kbl() from the
kableExtra package.

2 Data transformation using tidyverse
Import parent inc.csv to your R environment. The data frame looks like this:
famid father name mother name father income mother income
1 Arthur Jess 42000 45000
2 Harry Pam 35000 24000
3 Matt Mary 78000 55000

Use tidyverse functions and the piping syntax to transform the data frame to the following structure:
famid type name income
1 father Arthur 42000
1 mother Jess 45000
2 father Harry 35000
2 mother Pam 24000
3 father Matt 78000
3 mother Mary 55000

Make sure to document the steps you take in your code and display the tidied data frame in your PDF
document.

Hints:
• You can review how to use the pivoting functions here.
• You can use str remove() or str extract() functions for mutating a new variable that extracts
part of the text from a string, for example extracting “father” from “father name”.

• You can separate the original data frame into parts and then combine them if you cannot figure
out how to transform it altogether.
• You can format tables using kbl() from the kableExtra package.

3 Population, sample, and sampling distribution
To make your code reproducible, use the set.seed() function whenever you are generating random
numbers or sampling randomly. Read the documentation of this function in R if you do not know how it
works.

1. Create a population data frame that has one variable called “value”, whose value follows a normal
distribution with population mean µ = 5 and population variance σ
2 = 1 with 100,000 observations.
2. Create a histogram of the population with appropriate title and labels. Add a vertical line at the
population mean.

3. Draw a random sample from the population, with sample size n = 50.
4. Plot a histogram of the sample with appropriate title and labels. Add a vertical line at your point
estimate of the population mean. How does this histogram compare to the one you created in
question 2?

5. Based on your sample, report your point estimate of the population mean ˆµ, the standard error of
this estimate, and its 95% confidence interval. Show the formulas you used for calculating these
statistics.

6. Simulate the sampling distribution of the sample mean (n = 50) using 1,000 draws. That is, repeat
the action you took for question 3 for 1,000 times and save the mean you get for each repetition to
a data object. Hint: Use for loop.

7. Create a histogram of the sampling distribution of the sample mean you simulated in question 6
with appropriate title and labels. Add a vertical line at your point estimate of the population mean.

8. Using the sampling distribution you obtained in question 6, report your point estimate of the
population mean ˆµ, the standard error of this estimate, and the 95% confidence interval of this
estimate. Show the definitions or formulas you used for calculating these statistics. Hint: The
standard error in this question should be worked out based on the properties of the sampling
distribution.

9. Repeat questions 3 to 8 increasing the size of your sample to n = 1,000. Plot and report your
results. Then, using the concepts that we learned in class, summarize the differences with respect
to what you obtained with a sample of 50. Hint: Which law or theorem that we learned in class is
being demonstrated here?