DATA 201: Thinking with Data Assignment 3: Data Cleaning

$30.00

Category: Tags: , , , , You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (8 votes)

Goal
This assignment helps you learn about data cleaning.
Technology
Excel, OpenRefine
Submit
Submit as PDF in D2L->Assessments->Dropbox with images and questions answered.
Description [50 marks]
You should submit one PDF document that clearly answers the following questions.
Use the dataset (DATA201W21A3-Dataset.csv) from D2L for this assignment. The dataset
contains inconsistent formatted data entries, missing values, etc.
Your task is to clean this dataset. You need to show us the processes of cleaning 5 types of
“dirty” entities (e.g., all empty spaces at the beginning of a certain column would be
considered as one set, and duplicated rows would be considered as another set). The processes
should be substantially different from each other (e.g., removing excess spaces from column 1
and removing excess spaces from column 2 are considered as one single process). For each
cleaning process, you need to clearly show us:
1. A before cleaning screenshot of the dataset — highlighting the “dirty” parts (1 mark)
2. An after screenshot of the cleaned data — highlighting the parts that were “cleaned” (1
mark)
3. What data quality issue is being cleaned (2 marks)
4. An explanation of why you thought these entities were needed to be cleaned and how it
will help during analysis (6 marks). Note: Answers that are too general, e.g., “easier to
read” and “looks better”, are not acceptable.
The dataset does not need to be completely cleaned. Make sure all 5 processes are clearly
explained and all screenshots are legible — no marks will be given for illegible screenshots.
You must use Excel in at least two of the processes and OpenRefine in at least two of the
processes.
Note: The way you structure and organize your answers in assignments is a form of
presentation (an important part of the whole data analysis pipeline). It shows how well you can
convey your ideas to your audience (in this case, your TAs). You should make sure your answers
are easy to follow and understand. Up to 10 marks can be deducted for poor organization and
formatting of your document.
Rubrics
For Each Written Question
80% – 100%: Answers are excellent or only have minor mistakes. Detailed explanations are
provided. Analysis (if applicable or required) must be clear and well-expressed. Include
adequate visual aids if applicable.
60% – 79%: Answers are clear but with obvious mistakes. A decent job overall.
40% – 59%: Missing some important parts of the answers.
1% – 39%: Sloppy or incomplete.
0%: No answer.
You can find the mark breakdown on D2L (DATA201W21A2-Feedback.xlsx).
Submit the following using the Assignment 3 Dropbox in D2L:
1. PDF report named DATA201W21A3-Name.pdf. Ex. DATA201W21A3-jwhudson.pdf