COMP 333 — Lab Assignment 2

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (4 votes)

Motivation
The purpose of this assignment is develop Python code for Descriptive Data Analysis.
It builds on the lectures of Week 3–4 and handles quantitative and visual descriptions.
You are to extend the capabilities of the describe() function of pandas and scipy, and
to produce a grid of plots with univariate plots down the diagonal of the grid and bivariate
plots off the diagonal.
Assignment
Create a Jupyter notebook using Python code and any of its libraries, but especially pandas,
to write and test code to carry out the Descriptive Data Analysis tasks below in (1)–(3).
I Write your own Python functions quantDDA() and vizDDA() within the notebook, and
illustrate their use within the notebook.
I Structure your code and document your work.
I Test your code on the three examples of Week 2.
Organize your notebook to clearly separate and identify your work on parts (1), (2), and (3).
(1) Quantitative Descriptions (6 marks) Write your own Python function quantDDA()
to extend the capabilities of the describe() function of pandas and scipy. Your function
should take a dataframe as input, and for each feature, should report
I number of observations
I number of entries
I number of unique entries
I number of missing entries
I number of outliers
I number of extreme values
I mode, or modes
I mean
I standard deviation
I max
I min
I Q3
I Q2 (median)
I Q1
I skewness
I kurtosis
Show your code working on the three examples of Week 2.
(2) Visual Descriptions (3 marks) Write your own Python function vizDDA() to produce
a grid of plots. The grid is a square grid indexed by the features of the dataframe in both
dimensions. Down the diagonal is a univariate plot for each feature; and each off-diagonal
grid entry is a bivariate plot for the pair of features.
Remember to choose the plot that is appropriate to the type of data in the feature, or the
type of data in each of the two features.
You may begin by assuming that you have only continuous data, so that the histogram is an
appropriate univariate plot, and that the scatter plot is an appropriate bivariate plot. And
then work from there to improve your code to consider other types of data.
As well as the grid of plots include a heatmap of the missing values in the dataset.
Show your code working on the three examples of Week 2.
(3) Missing Values per Observation (1 marks) In preparation for Data Wrangling you
want to know how many missing values there are in each observation, and you wish for the
DDA to show the distribution of the number of missing values across the observations.
For your dataframe, engineer a new feature MissingValueCount which records the number
of missing values in each observation as a new column in the dataframe.
Run your function quantDDA() on the modified dataframe. It should show a description of
the distribution of missing values of the observations in the column for MissingValueCount.
Also run your function vizDDA() on the modified dataframe.
Show your code working on the three examples of Week 2.
Deliverable
Your deliverable is the completed ipynb notebook showing all computation, output, and
plots.
Remember that your notebook should clearly identify your work on parts (1), (2), and (3).