STA4102 Assignment 2

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (4 votes)

Question 1 (10 points)
Data: “Ames Iowa Housing Prices Dataset” (https://www.kaggle.com/datasets/emurphy/ames-iowahousing-prices-dataset)
a) Use multiple linear regression to fit the ‘SalePrice’ column using data in the file ‘train1.csv’. Select
up to 10 variables for your model. The choice is not key at this stage but a reasonable choice of a subset
it needed with a brief selection exploration. Report the RMSE upon using the ‘test1.csv’ dataset.
b) Fit a decision tree to the set of independent variables chosen above using ‘train1.csv’. Report the
RMSE from using ‘test1.csv’. ( library(Metrics); rmseDt = rmse(actual=testpricestarget,
predicted=predictedprices )
c) Fit a Random Forest to the set of independent variables chosen above using ‘train1.csv’. Report the
RMSE from using ‘test1.csv’.
d) Fit an XGBoost model to the set of independent variables chosen above using ‘train1.csv’. Report
the RMSE from using ‘test1.csv’.
e) Which parameters of XGBoost can you tune which change the RMSE from your experience?
f) Comment on the quality of the fits produced in each case? What can you conclude from this
predictive task about the nature of the sales price prediction?
Question 2 (10 points)
Data: “Pokemon for Data Mining and Machine Learning”
(https://www.kaggle.com/datasets/alopez247/pokemon)
a) Randomly select 70% of the rows as training data and the remaining rows as testing data.
b) Fit to the training data a decision tree, random forest and XGBoost model where the dependent
variable is the ‘Type_1’ column.
c) Report on the accuracy for predicting ‘Type_1’ on the testing rows for each model.
d) Produce a confusion matrix for each model and then comment on the model performances. Discuss
in terms of the confusion matrix and accuracy.