Description
1 (50 points). This exercise relates to the Red Wine Quality data set (winequality-red.csv), which
can be found under the Datasets modules in Canvas. The dataset contains a number of
physicochemical test variables for 1599 different red wine variants of the Portuguese “Vinho
Verde” wine.
The variables are
• fixed_acidity
• volatile_acidity
• citric_acid
• residual_sugar
• chlorides
• free_sulfur_dioxide
• total_sulfur_dioxide
• density
• pH
• sulphates
• alcohol (output variable based on sensory data)
• quality (score between 0 and 10)
Before reading the data into R or Python, you can view it in Excel or a text editor. For each of
the following questions, include the code you used to complete the task as your response, along
with any plots or numeric outputs produced. You may omit outputs that are not relevant (such as
dataframe contents), but still include all of your code.
(a, 6 points) Use the read.csv() function to read the data into R, or the csv library to read
in the data with python. In R you will load the data into a dataframe. In python you may store it
as a list of lists or use the pandas dataframe to store your data. Call the loaded data redwine.
Ensure that your column headers are not treated as a row of data.
(b, 8 points) Find the mean quality of all the wine samples. Then find the median alcohol
level for all the wine samples.
(c, 8 points) Produce a scatterplot that shows the relationship between wine density and
residual_sugar. Ensure it has appropriate axis labels and a title. Briefly state if you see any effect
of residual_sugar on density.
(d, 10 points) Create a new qualitative variable, called ALevel, by binning the alcohol
variable into two categories (High and Medium). Specifically, divide the data into two groups
based on whether the alcohol level exceeds 11 or not (alcohol greater than 11 is considered High
otherwise it is considered Medium).
Now produce side-by-side boxplots of the ratio of sulphates to chlorides (hint: create a new
variable that calculates sulphates / chlorides) for each of the two ALevel categories. There
should be two boxes on your figure, one for High and one for Medium. How many samples are
in the High category?
(e, 8 points) Produce a histogram showing the fixed_acidity numbers for both High and
Medium (ALevel) wine samples. You may choose to show both on a single plot (using side by
side bars) or produce one plot for High samples and one for Medium samples. Ensure whatever
figures you produce have appropriate axis labels and a title.
(f, 10 points) Continue exploring the data, producing two new plots of any type, and
provide a brief (one to two sentence) summary of your hypotheses and what you discover. Feel
free to think outside the box on this one but if you want something to point you in the right
direction, look at the summary statistics for various features, and think about what they tell you.
Perhaps try plotting various features from the dataset against each other and see if any patterns
emerge.
2 (50 points). This exercise involves the forestfires.csv dataset which can be found under the
Datasets modules in Canvas. The features of the dataset are:
• X: x-axis spatial coordinate
• Y: y-axis spatial coordinate
• month: month of the year (‘jan’ to ‘dec’)
• day: day of the week (‘mon’ to ‘sun’)
• FFMC: Fine Fuel Moisture Code index
• DMC: Duff Moisture Code index
• DC: Drought code index
• ISI: Initial spread index
• temp: Temperature in degrees Celsius
• RH: Relative Humidity in %
• wind: Wind speed (km/h)
• rain: Amount of rainfall (mm/m2)
• area: area that got burnt in the forest fire
(a, 6 points) Specify which of the predictors are quantitative (measuring numeric
properties such as size or quantity) and which are qualitative (measuring non-numeric properties
such as color, appearance, type etc.), if any?
Keep in mind that a qualitative variable may be
represented as a quantitative type in the dataset, or the reverse. You may wish to adjust the types
of your variables based on your findings.
(b, 8 points) What is the range, mean and standard deviation of each quantitative
predictor? Which month has the highest number of fires?
(c, 8 points) Produce boxplots of relative humidity (RH) by month. Your figure will have
a boxplot for every month. Which month has the highest median RH value?
(d, 10 points) Produce a bar plot to show the count of forest fires in each month for
which wind is greater than 4.9. During which months are high wind forest fires most common?
(Hint: filter data by wind, group data by month and calculate count.)
(e, 10 points) Using the full data set, investigate the predictors graphically, using
scatterplots, correlation scores or other tools of your choice. Create a correlation matrix for the
relevant variables.
(f, 8 points) Suppose that we wish to predict the Initial spread index (ISI) based on the
other variables. Which, if any, of the other variables might be useful in predicting ISI? Justify
your answer based on the prior correlations.