Principles of Urban Informatics Assignment 1

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (1 vote)

Data description
In this assignment we are going to build a set of python functions to process 311 complaints data. In all the problems we are going to use a subset of the 311 data obtained
at https://nycopendata.socrata.com/.
The data is given in the following format
Unique Key , C r e at e d Date , Cl o s e d Date , Agency , Agency Name , C om pl ai nt Type ,
D e s c r i p t o r , I n c i d e n t Zip , S t a t u s
2 7 2 4 1 1 6 8 , 0 1/ 2 4/ 2 0 1 4 1 0: 4 3: 0 0 AM, 0 1 / 2 4 / 2 0 1 4 0 1: 0 0: 0 0 PM, DEP , D e p a rtm e nt
o f E n vi r o n m e nt al P r o t e c t i o n , A s b e st o s , A s b e st o s C om pl ai nt ( B1 ) , 1 1 2 2 2 ,
Cl o s e d
2 7 0 3 9 7 6 0 , 0 1/ 0 2/ 2 0 1 4 1 2: 0 0: 0 0 AM, 0 1 / 0 5 / 2 0 1 4 1 2: 0 0: 0 0 AM, HPD, D e p a rtm e nt
o f H o u si n g P r e s e r v a t i o n and Development , PLUMBING, BASIN / SINK , 1 1 2 1 1 ,
Cl o s e d
2 7 2 8 9 9 9 2 , 0 1/ 2 8/ 2 0 1 4 1 1: 5 7: 2 7 PM, ,DOB, D e p a rtm e nt o f B uil di n g s , B u i l d i n g /
Use , No C e r t i f i c a t e Of Occupancy / I l l e g a l / C o n t r a r y To CO, 1 1 2 1 9 , Open
2 7 2 2 8 5 8 3 , 0 1/ 2 3/ 2 0 1 4 0 9: 0 9: 0 0 AM, 0 1 / 2 3 / 2 0 1 4 1 2: 0 0: 0 0 PM, DSNY,A − S t a t e n
I s l a n d , Snow , E9 Snow / I c y Si dewal k , 1 0 3 1 4 , Cl o s e d
. . .
Problem 1
Given an input data file, your script should find the range of dates in which complaints
were created. The code should get the name of the dataset as a command line parameter
and output the result as follows:
> p yt h o n p r o blem 1 . py s a m p l e d a t a p r o b l e m 1 . c s v
10 c o m p l a i n t s betwee n 0 1/ 0 2/ 2 0 1 4 0 0: 0 0: 0 0 and 0 1/ 2 8/ 2 0 1 4 2 3: 5 7: 2 7
Note: dates should be formatted exactly as shown, i.e., month/day/year hour:minutes:seconds):
1
Problem 2
Given an input data file, display the number of complaints of each complaint type (case
sensitive), and output as in the example below:
> p yt h o n p r o blem 2 . py s a m p l e d a t a p r o b l e m 2 . c s v
B u i l d i n g / Use wit h 1 c o m p l a i n t s
A s b e st o s wit h 1 c o m p l a i n t s
APPLIANCE wit h 1 c o m p l a i n t s
Non−R e s i d e n t i a l Heat wit h 1 c o m p l a i n t s
S t r e e t L i g h t C o n d i t i o n wit h 1 c o m p l a i n t s
Snow wit h 1 c o m p l a i n t s
HEATING wit h 2 c o m p l a i n t s
GENERAL CONSTRUCTION wit h 1 c o m p l a i n t s
PLUMBING wit h 1 c o m p l a i n t s
Problem 3
Similarly to problem 2, process the given input data file to output the number of complaints for each complaint type, but this time ordered in descending number of complaints, i.e., from the types with most complaints to fewer as in the example:
> p yt h o n p r o blem 3 . py s a m p l e d a t a p r o b l e m 3 . c s v
HEATING wit h 2 c o m p l a i n t s
APPLIANCE wit h 1 c o m p l a i n t s
A s b e st o s wit h 1 c o m p l a i n t s
B u i l d i n g / Use wit h 1 c o m p l a i n t s
GENERAL CONSTRUCTION wit h 1 c o m p l a i n t s
Non−R e s i d e n t i a l Heat wit h 1 c o m p l a i n t s
PLUMBING wit h 1 c o m p l a i n t s
Snow wit h 1 c o m p l a i n t s
S t r e e t L i g h t C o n d i t i o n wit h 1 c o m p l a i n t s
Note: if that happens, sort alphabetically the complaint types that have the same number of complaints, as in the example.
Problem 4
Given an input data file and an integer k, your script should compute and output the
top-k complaint types in terms of number of complaints. The code should get the
name of the dataset and k as command line parameters, and output the result ordered
in descending number of complaints as follows:
> p yt h o n p r o blem 4 . py s a m p l e d a t a p r o b l e m 4 . c s v 3
HEATING wit h 2 c o m p l a i n t s
APPLIANCE wit h 1 c o m p l a i n t s
A s b e st o s wit h 1 c o m p l a i n t s
Example of how to handle draws: if there are 5 top complaint types (say B, A, C, E, D)
with the same number of complaints, a top-3 query should show only types A, B, C as
below:
2
> p yt h o n p r o blem 4 . py s a m p l e d a t a p r o b l e m 4 . c s v 3
A wit h 221 c o m p l a i n t s
B wit h 221 c o m p l a i n t s
C wit h 221 c o m p l a i n t s
Note: if that happens, sort alphabetically the complaint types that have the same number of complaints, as in the example.
Problem 5
Given an input data file, your script should compute the number of complaints per day
of week. The code should get the name of the dataset as a command line parameter
and output the result as follows (should start on Monday and end on Sunday):
> p yt h o n p r o blem 5 . py s a m p l e d a t a p r o b l e m 5 . c s v
Monday == 0
T ue s da y == 2
Wednesday == 1
T h u r s d a y == 2
F r i d a y == 4
S a t u r d a y == 0
Sunday == 1
Note: we recommend to use the strptime from the time package to extract the day of
the week of a date.
Problem 6
Given an input data file, output for each agency the zip code that generates the largest
number of complaints for that agency. The agencies should be listed listed in alphabetical (only output agencies with at least one complaint that has a valid zipcode).
You should not assume all complaints contain a valid zip code and agency. The code
should get the name of the dataset as a command line parameter and output the result
as follows:
> p yt h o n p r o blem 6 . py s a m p l e d a t a p r o b l e m 6 . c s v
DEP 11222 1
DOB 11219 1
DOHMH 11220 1
DOT 11214 1
DSNY 10314 1
HPD 10011 10459 10460 11209 11211 1
In each line of the output, the first string is the agency name, the second one corresponds to the zip code (or zipcodes, see below for handling ties) with the maximum
number of complaints for the agency and the last number corresponds to the number of
complaints in the selected zip code for the agency in question.
Note: if two or more zip codes are tied as with most complaints, output the list of
all zipcodes with the maximum number of counts in lexicographical order, as in the
example above (the agency HPD has multiple zipcodes with 1 complaint in this data
file).
3
Problem 7
Now, use the zip borough.csv to count the number of complaints per borough (and also
using the previous data file). The zip borough.csv contains a list of zip codes and, for
each zip code, its borough.
The code should get the name of the dataset and the zip-borought.csv files as a
command line parameter and output the result as follows:
> p yt h o n p r o blem 7 . py s a m p l e d a t a p r o b l e m 7 . c s v z i p b o r o u g h . c s v
B r o o kl y n wit h 112 c o m p l a i n t s
Queens wit h 66 c o m p l a i n t s
Bronx wit h 65 c o m p l a i n t s
M a n h att a n wit h 44 c o m p l a i n t s
S t a t e n I s l a n d wit h 13 c o m p l a i n t s
You can assume that all complaints in the dataset contain a valid zip code and
borough. Note that the sum of all complaints in the output is 300, equal to the number of complaints in the input; this means that all complaints in the input file sample data problem 7.csv have a valid zip.
Problem 8
Repeat exercise 7, but do not assume that all complaints have valid zip codes, i.e.,
some lines contain a blank zip code and should be ignored. The input and output
should follow the same pattern as Problem 7. For example:
> p yt h o n p r o blem 8 . py s a m p l e d a t a p r o b l e m 8 . c s v z i p b o r o u g h . c s v
B r o o kl y n wit h 98 c o m p l a i n t s
Bronx wit h 62 c o m p l a i n t s
M a n h att a n wit h 51 c o m p l a i n t s
Queens wit h 50 c o m p l a i n t s
S t a t e n I s l a n d wit h 16 c o m p l a i n t s
Note that the sum of all complaints in the output is not 300. This means that some
of the complaints do not have a valid zip and have been ignored.
Setting up
If you still have not python installed in your system, you can check the instructions for
installation in the Dive into Python book available at http://www.diveintopython.
net/. In this course, we will be using Python 2.* (e.g., 2.7 should be enough). If you
install any version 3.* the syntax is going to be a little different.
The sample datasets can be obtained at http://vgc.poly.edu/projects/
gx5003-fall2014/week1/lab/data/sample_data_problem_x.csv. Similarly the zip borough file can be obtained at http://vgc.poly.edu/projects/
gx5003-fall2014/week1/lab/data/zip_borough.csv.
4
Questions
Any questions should be sent to the teaching staff (Instructor Role and Teaching Assistant Role) through the NYU Classes system.
How to submit your assignment?
Your assignment should be submitted using the NYU Classes system. You should submit all your python code (do not submit the data files) in a zip file named NetID assignment 1.zip,
you should change NetID, by your NYU Net ID. As illustrated above, for each problem
you should create a .py file called problemx.py, where x should be the problem number.
Grading
The grading is going to be done by a series of tests and manual inspection when required. Notice that grading by manual inspection is very subjective and therefore, you
should try as much as possible make your code run as specified so that the amount of
subjective grading is minimized. Make sure that your code runs on the sample datasets
as specified.
5