DATA 303 Assignment 2


Category: You will Instantly receive a download link for .zip solution file upon Payment


5/5 - (1 vote)

Assignment Questions

Q1.(35 marks) In a 2015 article comparing technological advancement of hybrid electric vehicles (HEV)
in different market segments, authors Lim et al. collected data on prices and other features for 154 HEV
models (Lim et al. 2015. Technological Forecasting and Social Change. vol 97, pages 140-153 ). We will use
regression analysis to explore the factors that influence price.

The dataset is in the file hybrid_reg.csv and
contains the following variables:
• carid: Vehicle ID
• vehicle: Make of vehicle
• year: Model year
• msrp: Manufacturer’s suggested retail price in 2013 (US dollars).
• accelrate: Acceleration rate in km/hour/second
• mpg: Fuel economy in miles/gallon
• mpgmpge: Max of mpg and mpge (mpge is miles per gallon equivalent for plug-in HEVs to take into
account the all electric range, with mpge = 33.7∗driverange
batterycapacity .

• carclass: Model class. C = Compact, M = Midsize, TS = 2 Seater, L = Large, PT = Pickup Truck,
MV = Minivan, SUV = Sport Utility Vehicle
• carclass_id: Index representing model class
The variables carid and vehicle are vehicle identifiers and will not be used in the analysis. Likewise
carclass_id will not be used as it is a numerical form of the variable carclass and does not provide any
additional information.

a. (3 marks) Read the dataset into R. Prepare the data for analysis by adding the new variables below
to the dataset. Give the number of observations in each year group of the new variable yr_group.:
• yr_group: group year as follows “1997-2004”, “2005-2008”, “2009-2011”, “2012-2013”.
• msrp.1000: convert msrp from US$ to US$1000 by dividing msrp by 1000.

b. (3 marks) Use the ggplot2 package to plot msrp.1000 against each of the predictor variables, yr_group,
accelrate, mpg, mpgmpge and carclass. Are there strong indications of non-linear relationships with
any of the numerical predictors? If so, which ones?

c. (3 marks) Create pairwise scatterplots of the numerical predictors. Is there any indication of potential
multicollinearity among these predictors?

d. (4 marks) Fit a linear model with all predictors (yr_group, accelrate, mpg, mpgmpge and carclass)
included in the model. Calculate the VIF statistic for the predictors. To check for evidence of
multicollinearity we will use a different threshold defined by
V IFmodel =
1 − R2
where R2
model is the R2 value for the model that includes all predictors. Using this threshold identifies
predictors that have stronger relationships with other predictors than the response variable has.

is a more stringent way of identifying multicollinearity. If GV IF(1/(2×Df)) > V IFmodel, then this is
evidence of severe multicollinearity. Calculate V IFmodel for your fitted model. Is there evidence of
severe multicollinearity? Are you surprised by the result?

e. (3 marks) Fit a generalised additive model to the data including all predictors, using a smooth spline
for each numerical predictor. Present the RSE, R2 and adjusted R2 values in a table.

f. (3 marks) Print the results for the significance of smooth terms in a table. Which of the numerical
predictors have a significant non-linear effect on msrp.1000? Justify your answer briefly.

g. (4 marks) Perform a diagnostic check of regression assumptions and adequacy of basis functions for
the model you fitted in part (e). What conclusions do you draw from your results? (Note: ensure your
diagnostic plots fit on a single page).

h. (4 marks) Calculate and print a table of AIC values for the model in part (e) (Model 1) and each of
the following models:
• Model 2: excludes mpg only from Model 1
• Model 3: excludes mpgmpge only from Model 1
• Model 4: excludes mpg and mpgmpge from Model 1

i. [3 marks] What do your results in part (h) indicate about whether both mpg and mpgmpge should be
included in the model? Explain your answer briefly. What regression pitfall does this point to?
j. [2 marks] Are you surprised by your conclusions in part (i) given your findings in part (d)? Explain
your answer briefly.

k. [3 marks] Calculate and print a table of BIC values for Models 1 to 4. Based on these results, which
model would you choose as your preferred model? Explain your answer briefly.

2. Q2. (5 marks) Suppose we have a data set with five predictors:
• X1 =GPA
• X2 =IQ
• X3 =Gender(0=female, 1=male)
• X4 =Interaction between GPA and IQ
• X5 =Interaction between GPA and Gender.

The response variable, Y , is starting salary after graduation (in thousands of dollars). Suppose we get the
following regression coefficient estimates:
0 = 5, βˆ
1 = 8, βˆ
2 = 0.2, βˆ
3 = 10,
4 = 0.05, βˆ
5 = 2

a. (1 mark) Write down the estimated model equation in terms of Yˆ , X1, X2 and X3.
b. (3 marks) Which one of the following statements is correct and why? Show any working you do.
i. For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA
is high enough.
ii. For a fixed value of IQ and GPA, females earn more on average than males.
iii. The difference in expected salary between males and females increases as GPA increases.
iv. An increase in IQ by one point is associated with a reduction in expected salary, provided GPA is
high enough.

c. (1 mark) True or False: Since the coefficient for the GPA:IQ interaction term is very small, there is
very little evidence of an interaction effect. Justify your answer.
Assignment total: 40 marks