Description
Exercise 1 (Subsetting and Statistics)
For this exercise, we will use the msleep dataset from the ggplot2 package.
(a) Install and load the ggplot2 package. Do not include the installation command in your .Rmd file. (If
you do it will install the package every time you knit your file.) Do include the command to load the package
into your environment.
(b) Note that this dataset is technically a tibble, not a data frame. How many observations are in this
dataset? How many variables? What are the observations in this dataset?
(c) What is the mean hours of REM sleep of individuals in this dataset?
(d) What is the standard deviation of brain weight of individuals in this dataset?
(e) Which observation (provide the name) in this dataset gets the most REM sleep?
(f) What is the average bodyweight of carnivores in this dataset?
Exercise 2 (Plotting)
For this exercise, we will use the birthwt dataset from the MASS package.
(a) Note that this dataset is a data frame and all of the variables are numeric. How many observations are
in this dataset? How many variables? What are the observations in this dataset?
(b) Create a scatter plot of birth weight (y-axis) vs mother’s weight before pregnancy (x-axis). Use a
non-default color for the points. (Also, be sure to give the plot a title and label the axes appropriately.)
Based on the scatter plot, does there seem to be a relationship between the two variables? Briefly explain.
(c) Create a scatter plot of birth weight (y-axis) vs mother’s age (x-axis). Use a non-default color for the
points. (Also, be sure to give the plot a title and label the axes appropriately.) Based on the scatter plot,
does there seem to be a relationship between the two variables? Briefly explain.
(d) Create side-by-side boxplots for birth weight grouped by smoking status. Use non-default colors for the
plot. (Also, be sure to give the plot a title and label the axes appropriately.) Based on the boxplot, does
there seem to be a difference in birth weight for mothers who smoked? Briefly explain.
1
Exercise 3 (Importing Data, More Plotting)
For this exercise we will use the data stored in nutrition-2018.csv. It contains the nutritional values per
serving size for a large variety of foods as calculated by the USDA in 2018. It is a cleaned version totaling
5956 observations and is current as of April 2018.
The variables in the dataset are:
• ID
• Desc – short description of food
• Water – in grams
• Calories – in kcal
• Protein – in grams
• Fat – in grams
• Carbs – carbohydrates, in grams
• Fiber – in grams
• Sugar – in grams
• Calcium – in milligrams
• Potassium – in milligrams
• Sodium – in milligrams
• VitaminC – vitamin C, in milligrams
• Chol – cholesterol, in milligrams
• Portion – description of standard serving size used in analysis
(a) Create a histogram of Calories. Do not modify R’s default bin selection. Make the plot presentable.
Describe the shape of the histogram. Do you notice anything unusual?
(b) Create a scatter plot of calories (y-axis) vs protein (x-axis). Make the plot presentable. Do you notice
any trends? Do you think that knowing only the protein content of a food, you could make a good prediction
of the calories in the food?
(c) Create a scatter plot of Calories (y-axis) vs 4 * Protein + 4 * Carbs + 9 * Fat (x-axis). Make the
plot presentable. You will either need to add a new variable to the data frame, or use the I() function in
your formula in the call to plot(). If you are at all familiar with nutrition, you may realize that this formula
calculates the calorie count based on the protein, carbohydrate, and fat values. You’d expect then that the
result here is a straight line. Is it? If not, can you think of any reasons why it is not?
Exercise 4 (Writing and Using Functions)
For each of the following parts, use the following vectors:
a = 1:10
b = 10:1
c = rep(1, times = 10)
d = 2 ^ (1:10)
(a) Write a function called sum_of_squares.
• Arguments:
– A vector of numeric data x
• Output:
– The sum of the squares of the elements of the vector Pn
i=1 x
2
i
Provide your function, as well as the result of running the following code:
2
sum_of_squares(x = a)
sum_of_squares(x = c(c, d))
(b) Using only your function sum_of_squares(), mean(), sqrt(), and basic math operations such as + and
-, calculate
vuut
1
n
Xn
i=1
(xi − 0)2
where the x vector is d.
(c) Using only your function sum_of_squares(), mean(), sqrt(), and basic math operations such as + and
-, calculate
vuut
1
n
Xn
i=1
(xi − yi)
2
where the x vector is a and the y vector is b.
Exercise 5 (More Writing and Using Functions)
For each of the following parts, use the following vectors:
set.seed(42)
x = 1:100
y = rnorm(1000)
z = runif(150, min = 0, max = 1)
(a) Write a function called list_extreme_values.
• Arguments:
– A vector of numeric data x
– A positive constant, k, with a default value of 2
• Output:
– A list with two elements:
∗ small, a vector of elements of x that are k sample standard deviations less than the sample
mean. That is, the observations that are smaller than x¯ − k · s.
∗ large, a vector of elements of x that are k sample standard deviations greater than the sample
mean. That is, the observations that are larger than x¯ + k · s.
Provide your function, as well as the result of running the following code:
list_extreme_values(x = x, k = 1)
list_extreme_values(x = y, k = 3)
list_extreme_values(x = y, k = 2)
list_extreme_values(x = z, k = 1.5)
(b) Using only your function list_extreme_values(), mean(), and basic list operations, calculate the mean
of observations that are greater than 1.5 standard deviation above the mean in the vector y.
3