Description
1 Logistic Regression [25 points]
A Logistic Regression classifier can be trained with historical health-care data to make
future predictions. A training set D is composed of {(xi
, yi)}
N
1
, where yi ∈ {0, 1} is the
label and xi ∈ Rd
is the feature vector of the i-th patient. In logistic regression we have
p(yi = 1|xi) = σ(wT xi), where w ∈ Rd
is the learned coefficient vector and σ(t) = 1
1+e−t
is
the sigmoid function.
Suppose your system continuously collects patient data and predicts patient severity using Logistic Regression. When patient data vector x arrives to your system, the system
needs to predict whether the patient has a severe condition (predicted label ˆy ∈ {0, 1}) and
requires immediate care or not. The result of the prediction will be delivered to a physician,
who can then take a look at the patient. Finally, the physician will provide feedback (truth
label y ∈ {0, 1}) back to your system so that the system can be upgraded, i.e. w recomputed,
to make better predictions in the future.
NOTE: We will not accept hand-written, screenshots, or other images for the derivations
in this section. Please use Microsoft Word or Latex and convert to PDF for your final
submission.
1.1 Batch Gradient Descent
The negative log-likelihood can be calculated according to
NLL (D, w) = −
PN
i=1
(1 − yi) log(1 − σ(wT xi)) + yi
log σ(wT xi)
The maximum likelihood estimator wMLE can be found by solving for arg min
w
NLL through
an iterative gradient descent procedure.
a. Derive the gradient of the negative log-likelihood in terms of w for this setting. [5
points]
1.2 Stochastic Gradient Descent
If N and d are very large, it may be prohibitively expensive to consider every patient in D
before applying an update to w. One alternative is to consider stochastic gradient descent,
in which an update is applied after only considering a single patient.
a. Show the log likelihood, l, of a single (xt
, yt) pair. [5 points]
b. Show how to update the coefficient vector wt when you get a patient feature vector
xt and physician feedback label yt at time t using wt−1 (assume learning rate η is given). [5
points]
c. What is the time complexity of the update rule from b if xt
is very sparse? [2 points]
5
d. Briefly explain the consequence of using a very large η and very small η. [3 points]
e. Show how to update wt under the penalty of L2 norm regularization. In other words,
update wt according to l − µkwk
2
2
, where µ is a constant. What’s the time complexity? [5
points]
f. When you use L2 norm, you will find each time you get a new (xt
, yt) you need to
update every element of vector wt even if xt has very few non-zero elements. Write the
pseudo-code on how to update wt
lazily. [Extra 5 points] (no partial credit!)
HINT: Update j-th element of wt
, wtj , only when j-th element of xt
, xtj , is non-zero.
You can refer to Sec.10 and 11 and the appendix of this paper.
2 Programming [75 points]
First, follow the instructions to install the environment if you haven’t done that yet. You
will also need the hw2/data/ from Canvas.
You will then need to install Python 2 for this homework due to limitations in
Hadoop’s Streaming Map Reduce interface. To install Python 2, connect to your Docker
instance and begin by downloading the Anaconda 2 installation script and kicking it off with
the commands below:
c u r l h t t p s : / / r ep o . cont inuum . i o / a r c h i v e /Anaconda2 −5.0.1−Linux−x 8 6 6 4 . sh > conda2 . sh
Kickoff the installation using bash conda2.sh. Follow the prompts, and when it asks if you
want to change the installation path, enter /usr/local/conda2 instead of the default. You
may chose to add this Python path to your bash profile when prompted, or you can manually
call Python 2 using /usr/local/conda2/bin/python for the remainder of the assignment.
Please ensure you are always using Python 2 for the remainder of the assignment or you may encounter problems!
2.1 Descriptive Statistics [10 points]
Computing descriptive statistics on the data helps in developing predictive models. In this
section, you need to write HIVE code that computes various metrics on the data. A skeleton
code is provided as a starting point.
The definition of terms used in the result table are described below:
• Event Count: Number of events recorded for a given patient. Note that every line
in the input file is an event.
• Encounter Count: Count of unique dates on which a given patient visited the ICU.
6
• Record Length: Duration (in number of days) between first event and last event for
a given patient.
• Common Diagnosis: 5 most frequently occurring disease.
• Common Laboratory Test: 5 most frequently conducted test.
• Common Medication: 5 most frequently prescribed medications.
While counting common diagnoses, lab tests and medications, count all the occurrences of
the codes. e.g. if one patient has the same code 3 times, the total count on that code should
include all 3. Furthermore, the count is not per patient but per code.
a. Complete hive/event statistics.hql for computing statistics required in the question.
Please be aware that you are not allowed to change the filename.
b. Use events.csv and mortality.csv provided in data as input and fill Table 2 with
actual values. We only need the top 5 codes for common diagnoses, labs and medications.
Their respective counts are not required.
Metric Deceased patients Alive patients
Event Count
1. Average Event Count
2. Max Event Count
3. Min Event Count
Encounter Count
1. Average Encounter Count
2. Max Encounter Count
3. Min Encounter Count
Record Length
1. Average Record Length
2. Median Record Length
3. Max Record Length
4. Min Record Length
Common Diagnosis
Common Laboratory Test
Common Medication
Table 2: Descriptive statistics for alive and dead patients
Deliverable: code/hive/event statistics.hql [10 points]
2.2 Transform data [20 points]
In this problem, we will convert the raw data to standardized format using Pig. Diagnostic,
medication and laboratory codes for each patient should be used to construct the feature
7
vector and the feature vector should be represented in SVMLight format. You will work
with events.csv and mortality.csv files provided in data folder.
Listed below are a few concepts you need to know before beginning feature construction
(for details please refer to lectures).
• Observation Window: The time interval containing events you will use to construct
your feature vectors. Only events in this window should be considered. The observation
window ends on the index date (defined below) and starts 2000 days (including 2000)
prior to the index date.
• Prediction Window: A fixed time interval following the index date where we are
observing the patient’s mortality outcome. This is to simulate predicting some length
of time into the future. Events in this interval should not be included while constructing
feature vectors. The size of prediction window is 30 days.
• Index date: The day on which we will predict the patient’s probability of dying
during the subsequent prediction window. Events occurring on the index date should
be considered within the observation window. Index date is determined as follows:
– For deceased patients: Index date is 30 days prior to the death date (timestamp
field) in mortality.csv.
– For alive patients: Index date is the last event date in events.csv for each alive
patient.
You will work with the following files in code/pig folder
• etl.pig: Complete this script based on provided skeleton.
• utils.py: Implement necessary User Defined Functions (UDF) in Python in this file
(optional).
In order to convert raw data from events to features, you will need a few steps:
1. Compute the index date: [4 points] Use the definition provided above to compute the
index date for all patients.
2. Filter events: [4 points] Consider an observation window (2000 days) and prediction
window (30 days). Remove the events that occur outside the observation window.
3. Aggregate events: [4 points] To create features suitable for machine learning, we will
need to aggregate the events for each patient as follows:
• count: occurrence for diagnostics, lab and medication events (i.e. event id starting with DRUG, LAB and DIAG respectively) to get their counts.
8
Each event type will become a feature and we will directly use event id as feature
name. For example, given below raw event sequence for a patient,
1053 , DIAG319049 , Acute respiratory failure ,2924 -10 -08 ,1.0
1053 , DIAG197320 , Acute renal failure syndrome ,2924 -10 -08 ,1.0
1053 , DRUG19122121 , Insulin ,2924 -10 -08 ,1.0
1053 , DRUG19122121 , Insulin ,2924 -10 -11 ,1.0
1053 , LAB3026361 , Erythrocytes in Blood ,2924 -10 -08 ,3.000
1053 , LAB3026361 , Erythrocytes in Blood ,2924 -10 -08 ,3.690
1053 , LAB3026361 , Erythrocytes in Blood ,2924 -10 -09 ,3.240
1053 , LAB3026361 , Erythrocytes in Blood ,2924 -10 -10 ,3.470
We can get feature value pairs(event id, value) for this patient with ID 1053 as
( DIAG319049 , 1)
( DIAG197320 , 1)
( DRUG19122121 , 2)
( LAB3026361 , 4)
4. Generate feature mapping: [4 points] In above result, you see the feature value as well
as feature name (event id here). Next, you need to assign an unique identifier for
each feature. Sort all unique feature names in ascending alphabetical order and assign
continuous feature id starting from 0. Thus above result can be mapped to
(1 , 1)
(0 , 1)
(2 , 2)
(3 , 4)
5. Normalization: [4 points] In machine learning algorithms like logistic regression, it is
important to normalize different features into the same scale. Implement min-max
normalization on your results. (Hint: min(xi) maps to 0 and max(xi) 1 for feature xi
,
min(xi) is zero for count aggregated features).
6. Save in SVMLight format: If the dimensionality of a feature vector is large but the
feature vector is sparse (i.e. it has only a few nonzero elements), sparse representation
should be employed. In this problem you will use the provided data for each patient to
construct a feature vector and represent the feature vector in SVMLight format shown
below:
9
< line > .=. < target > < feature >: < value > < feature >: < value >
< target > .=. +1 | -1 | 0 | < float >
< feature > .=. < integer > | qid
< value > .=. < float >
< info > .=. < string >
For example, the feature vector in SVMLight format will look like:
1 2:0.5 3:0.12 10:0.9 2000:0.3
0 4:1.0 78:0.6 1009:0.2
1 33:0.1 34:0.98 1000:0.8 3300:0.2
1 34:0.1 389:0.32
where, 1 or 0 will indicate whether the patient is dead or alive( i.e. the target), and it
will be followed by a series of feature-value pairs sorted by the feature index (idx) value.
To run your pig script in local mode, you will need the command:
sudo pig -x local etl . pig
Deliverable: pig/etl.pig and pig/utils.py [20 points]
2.3 SGD Logistic Regression [15 points]
In this question, you are going to implement your own Logistic Regression classifier in Python
using the equations you derived in question 1.2.e. To help you get started, we have provided
a skeleton code. You will find the relevant code files in lr folder. You will train and test a
classifier by running
1. cat path/to/train/data | python train.py -f
2. cat path/to/test/data | python test.py
The training and testing data for this problem will be output from previous Pig ETL problem.
To better understand the performance of your classifier, you will need to use standard
metrics like AUC. Similarly with Homework 1, we provide code/environment.yml which contains a list of libraries needed to setup the environment for this homework. You can use it to
create a copy of conda ‘environment’ (http://conda.pydata.org/docs/using/envs.html#useenvironment-from-file). If you already have your own Python development environment (it
should be Python 2.7), please refer to this file to find necessary libraries. It will help you
10
install necessary modules for drawing an ROC curve. You may need to modify it if you want
to install it somewhere else. Remember to restart the terminal after installation.
a. Update the lrsgd.py file. You are allowed to add extra methods, but please make
sure the existing method names and parameters remain unchanged. Use standard modules
of Python 2.7 only, as we will not guarantee the availability of any third party modules while
testing your code. [10 points]
b. Show the ROC curve generated by test.py in this writing report for different learning
rates η and regularization parameters µ combination and briefly explain the result. [5 points]
c. [Extra 10 points (no partial credit!)] Implement using the result of question 1.2.f,
and show the speed up. Test efficiency of your approach using larger data set training.data,
which has 5675 possible distinct features. Save the code in a new file lrsgd fast.py. We will
test whether your code can finish witin reasonable amount of time and correctness of trained
model. The training and testing data set can be downloaded from:
https :// s3 . amazonaws . com /6250 bdh / hw2 / training . data
https :// s3 . amazonaws . com /6250 bdh / hw2 / testing . data
Deliverable: lr/lrsgd.py and optional lr/lrsgd fast.py [15 points]
2.4 Hadoop [15 points]
In this problem, you are going to train multiple logistic regression classifiers using your
implementation of previous problem with Hadoop in parallel. The pseudo code of Mapper
and Reducer are listed as Algorithm 1 and Algorithm 2 respectively. We have already written
the code for the reducer for you. Find related files in lr folder.
input : Ratio of sampling r, number of models M, input pair (k, v)
output: key-value pairs
1 for i ← 1 to M do
2 m ←Random;
3 if m < r then
4 Emit (i, v)
5 end
6 end
Algorithm 1: Map function
input : (k, v)
output: Trained model
1 Fit model on v;
Algorithm 2: Reduce function
You need to copy training (the output of Pig ETL) into HDFS using command line.
11
hdfs dfs – mkdir / hw2
hdfs dfs – put pig / training / hw2 # adjust train path as needed
a. Complete the mapper.py according to pseudo code. [5 points]
You could train 5 ensembles by invoking
hadoop jar \
/ usr / lib / hadoop – mapreduce / hadoop – streaming . jar \
-D mapreduce . job . reduces =5 \
– files lr \
– mapper ” python lr / mapper . py -n 5 -r 0.4 ” \
– reducer ” python lr / reducer . py -f < number of features > ” \
– input / hw2 / training \
– output / hw2 / models
Notice that you could apply other parameters to reducer. To test the performance of ensembles, copy the trained models to local via
hdfs dfs – get / hw2 / models
b. Complete the testensemble.py to generate the ROC curve. [5 points]
cat pig / testing /* | python lr / testensemble . py -m models
c. Compare the performance with that of previous problem and briefly analyze why the
difference. [5 points]
2.5 Zeppelin [10 points]
Apache Zeppelin is an web based notebook that enables interactive data analytics (like
Jupyter). Because you can execute your code piecewise interactively, you’re encouraged to
use this at the initial stage your development for fast prototyping and initial data exploration.
Check out the course lab pages for a brief introduction on how to set it up and use it. Please
fill in the TODO section of the JSON file, zeppelin\bdh hw2 zeppelin.json. Import this
notebook file on Zeppelin first. For easier start, we will read in the dataset we have been
using in the lab, events.csv and mortality.csv, first in this part. Read carefully the provided
comments in the Notebook and fill-in the TODO section:
• Event count: Number of events recorded for a given patient. Note that every line in
the input file is an event. [3 points]
• For the average, maximum and minimum event counts, show the breakdown by dead
or alive. Produce a chart like below (please note that values and axis labels here are
12
for illustrative purposes only and maybe different with the actual data): [2 points]
• Encounter count: Count of unique dates on which a given patient visited the ICU.
All the events: DIAG, LAB and DRUG – should be considered as ICU visiting events.[3
points]
• For the average, maximum and minimum encounter counts, show the breakdown by
dead or alive. Produce a chart like below (please note that values and axis labels here
are for illustrative purposes only and maybe different with the actual data): [2 points]
Fill in the indicated TODOs in the notebook using ONLY Scala.
Please be aware that you are NOT allowed to change the filename and any existing function declarations.
The submission bdh hw2 zeppelin.json file should have all cells run and charts visible.
Please don’t clear the cell output before submission.
Deliverable: zeppelin\bdh hw2 zeppelin.json[10 points]
2.6 Submission [5 points]
The folder structure of your submission should be as below. You can display fold structure
using tree command. All other unrelated files will be discarded during testing.
< your gtid >- < your gt account > – hw2
| – – code
| | – – hive
13
| | \ – – event_statistics . hql
| | – – lr
| | | – – lrsgd . py
| | | – – lrsgd_fast . py ( Optional )
| | | – – mapper . py
| | \ – – testensemble . py
| | – – pig
| | | – – etl . pig
| | \ – – utils . py
| \ – – zeppelin
| \ – – bdh_hw2_zeppelin . json
\ – – homework2_answer . pdf
Please create a tar archive of the folder above with the following command and submit
the tar file.
tar – czvf < your gtid > – < your gt account > – hw2 . tar . gz \
< your gtid >- < your gt account > – hw2
Example submission: 901234567-gburdell3-hw2.tar.gz
14