Description
1 Objectives
1. Practice efficiently manipulating data with Python
2. Use the matplotlib, pandas libraries
3. Gain familiarity with data import and plotting
2 Tasks
1. Download the “Adult” data set from the UCI Machine Learning data repository: https://archive.
ics.uci.edu/ml/datasets/Adult. This dataset is record of adults, along with various occupational
and lifestyle attributes. Each adult is “labeled” as to whether or not they make more or less than $50k
per year. Using this as a driving label, one would typically want to design a process to determine what
combinations of factors enable a person to make more than $50k per year.
2. Read the data into a pandas DataFrame object.
3. Pivot the data so that you can:
(a) Rank each occupation in terms of most likely to earn more than $50k.
(b) Calculate the median age of people who make more and less than $50k. Calculate the medians
for each again, but now split apart by the sex of the adults.
(c) Calculate the mean and median number of years of education held by adults who make more and
less than $50k.
(d) Plot a histogram to see the distribution of years of school held by adults who make more and less
than $50k.
(e) Plot a histogram to see the distribution of ages of adults who make more and less than $50k.
4. Practice some other types of visualization with your choice of variables:
(a) Plot at least one bar plot that shows a trend within a variable that does not have inter-relationship.
(b) Plot at least one line plot that shows a trend, where the x-axis variable manifests fully across the
plot domain.
(c) Plot at least one scatter plot of the data that suggests a relationship between two discrete variables, where the x-axis has an inter-related progression (amount, cost, process, etc.) but is not
fully/evenly observed across the plot domain.
(d) Plot a histogram of a quantity that varies across the x-axis domain, but where we would like to
consider a general distribution, rather than individual observations.
1
5. Using pivoting, plotting, sorting, etc., attempt to answer the following questions with the data. Record
your responses, with visual examples, in your notebook.
(a) What is the relationship between between gender and whether a person makes more than $50k?
(b) When a person makes more than $50, what is the relationship between gender and occupation?
What about for less?
(c) When are the “richest” professions in each possible native country?
(d) What is the relationship between race and level of education? Does it further seem to relate to
whether a person makes more than $50k?
3 Submission
Zip your source files and upload them to the assignment page on MyClasses. Be sure to include all source
files, properly documented, a README file to describe the program and how it works, along with answers to
any above discussion questions.
2