Sale!

CptS 475/575 Assignment 2: R basics and Exploratory Data Analysis

$30.00 $18.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (7 votes)

1. This exercise relates to the College data set, which can be found in the file College.csv on the
course’s public webpage (https://scads.eecs.wsu.edu/index.php/datasets/). The dataset contains a
number of variables for 777 different universities and colleges in the US. The variables are
• Private : Public/private indicator
• Apps : Number of applications received
• Accept : Number of applicants accepted
• Enroll : Number of new students enrolled
• Top10perc : New students from top 10% of high school class
• Top25perc : New students from top 25% of high school class
• F.Undergrad : Number of full-time undergraduates
• P.Undergrad : Number of part-time undergraduates
• Outstate : Out-of-state tuition
• Room.Board : Room and board costs
• Books : Estimated book costs
• Personal : Estimated personal spending
• PhD : Percent of faculty with Ph.D.’s
• Terminal : Percent of faculty with terminal degree
• S.F.Ratio : Student/faculty ratio
• perc.alumni : Percent of alumni who donate
• Expend : Instructional expenditure per student
• Grad.Rate : Graduation rate
Before reading the data into R or Python, you can view it in Excel or a text editor. For each of
the following questions, include the code you used to complete the task as your response, along
2
with any plots or numeric outputs produced. You may omit outputs that are not relevant (such as
dataframe contents), but still include all of your code.
(a) Use the read.csv() function to read the data into R, or the csv library to read in the
data with python. In R you will load the data into a dataframe. In python you may store it as a list
of lists or use the pandas dataframe. Call the loaded data college. Ensure that your column
headers are not treated as a row of data.
(b) Find the median cost of books for all schools in this dataset.
(c) Produce a scatterplot that shows a relationship between two features of your choice in
the dataset. Ensure it has appropriate axis labels and a title.
(d) Produce a histogram showing the overall enrollment numbers (P.Undergrad plus
F.Undergrad) for both public and private (Private) schools. Ensure it has appropriate axis labels
and a title.
(e) Create a new qualitative variable, called Top, by binning the Top25perc variable into
two categories. Specifically, divide the schools into two groups based on whether or not the
proportion of students coming from the top 25% of their high school classes exceeds 50%.
Now produce side-by-side boxplots of acceptance rate (based on Accept and Apps) with respect
to the two Top categories (Yes and No). How many top universities are there?
(f) Continue exploring the data, producing two or more new plots of any type, and
provide a brief summary of your hypotheses and what you discover. You may use additional
plots or numerical descriptors as needed. Feel free to think outside the box on this one but if you
want something to point you in the right direction, look at the summary statistics for various
features, and think about what they tell you. Perhaps try plotting various features from the
dataset against each other and see if any patterns emerge.
2. This exercise involves the Auto.csv data set found on the course website. Make sure that the
missing values have been removed from the data.
(a) Specify which of the predictors are quantitative, and which are qualitative? Keep in
mind that a qualitative variable may be represented as a quantitative type in the dataset, or the
reverse. You may wish to adjust the types of your variables based on your findings.
(b) What is the range, mean and standard deviation of each quantitative predictor?
(c) Now remove the 45th through 85th (inclusive) observations from the dataset. What is
the range, mean, and standard deviation of each predictor in the subset of the data that remains?
(d) Using the full data set, investigate the predictors graphically, using scatterplots,
correlation scores or other tools of your choice. Create some plots highlighting the relationships
you find among the predictors. Explain briefly what the relationships between variables are, and
what they mean.
3
(e) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables.
Which, if any, of the other variables might be useful in predicting mpg? Justify your answer.