Homework 2: SDGB 7840

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (5 votes)

1. Read the posted article, “Bordeaux wine vintage quality and weather,” by Ashenfelter,
Ashmore, and LaLonde (CHANCE, 1995). Three regression models are considered in
this article. Answer the following questions:
(a) What is a wine “vintage”?
(b) What is the response variable for the three models described in this paper?
Now, download the data in “wine.txt”. This is some of the data the authors used
to fit their models. The columns are: vintage (VINT), log of average vintage price
relative to 1961 (LPRICE2), rainfall in the months preceding the vintage in mL
(WRAIN), average temperature over the growing season in ◦C (DEGREES), rainfall in September and August in mL (HRAIN), and age of wine in years (TIME SV).
Note: the average temperature in September is not available in our data set so we
cannot fit the third regression model from the paper.
(c) Which values of LPRICE2 are missing and, according to the article, why have they
been omitted?
(d) Make a scatterplot matrix of the variables (explanatory and response) included in
the models. Describe what you see.
1
(e) Fit the two regression models from the paper. Which is the best regression model?
Justify your answer and include relevant output (let α = 0.05). Did you choose the
same model as the authors?
(f) What is the sample size for your models?
(g) Write out the regression equation of the model you chose in part (e). Remember to
include the units of measurement. Interpret the partial slopes and the y-intercept.
Does the y-intercept have a practical interpretation?
(h) Make a table with the following statistics for both models: SSE, RMSE, PRESS,
and RMSEjackknife. Compare the relevant statistics. Based on this information,
would you change your answer to part (e)? Justify your answers.
(i) Could we use these regression models to predict quality for wines produced in 2005?
Justify your answer.
2. We will model the prestige level of occupations using variables such as education and
income levels. This data was collected in 1971 by Statistics Canada (the Canadian
equivalent of the U.S. Census Bureau or the National Bureau of Statistics of China)1
.
The data is in the file “prestige.dat” and the variables are described below:
variable description
prestige (y) Pineo-Porter prestige score for occupation, from a social survey
conducted in the mid-1960s
education average education of occupational incumbents, years, in 1971
income average income of incumbents, dollars, in 1971
women percentage of incumbents who are women
census Canadian Census occupational code
type type of occupation: “bc”=blue collar,
“prof”= professional/managerial/technical,
“wc”=white collar
(a) Do some internet research and write a short paragraph in your own words about
how the Pineo-Porter prestige score is computed. Include the reference(s) you used.
Do you think this score is a reliable measure? Justify your answer.
(b) Create a scatterplot matrix of all the quantitative variables. Use a different symbol
for each profession type: no type (pch=3), “bc” (pch=6), “prof” (pch=8), and “wc”
(pch=0) when making your plot. For the remainder of this question, we will use the
explanatory variables: income, education, and type. Does restricting our regression
to only these variables make sense given your exploratory analysis? Justify your
answer.
1Source: Canada (1971) Census of Canada. Vol. 3, Part 6. Statistics Canada; 19-1–19-21.
Page 2 of 3
(c) Which professions are missing “type”? Since the other variables for these observations are available, we could group them together as a fourth professional category
to include them in the analysis. Is this advisable or should we remove them from
our data set? Justify your answer.
(d) Visually, does there seem to be an interaction between type and education and/or
type and income? Justify your answer.
(e) Fit a model to predict prestige using: income, education, type, and any interaction
terms based on your answer to part (d). Evaluate the model and include relevant
output. Use your answer to part (c) to determine which observations to use in your
analysis.
(f) Create a histogram of income and a second histogram of log(income) (i.e., natural
logarithm). How does the distribution change?
(g) Fit the model in (e) but this time use log(income) (i.e., natural logarithm) instead
of income. Evaluate the model and provide the relevant output.
(h) Is the model in (e) or (g) better? Justify your answer. Why can’t we use a partial
F-test here?
Page 3 of 3