## Description

In many data science applications, you want to identify patterns, labels or classes based on available data. In this assignment we will focus on discovering patterns in your past stock

behavior.

To each trading day i you will assign a ”trading” label ” + ” or

” − ”. depending whether the corresponding daily return for

that day ri ≥ 0 or ri < 0. We will call these ”true” labels and

we compute these for all days in all 5 years.

We will use years 1,2 ans 3 as training years and we will use

years 4 and 5 as testing years. For each day in years 4 and 5 we

will predict a label based on some patterns that we observe in

training years. We will call these ”predicted” labels. We know

the ”true” labels for years 4 and 5 and we compute ”predicted”

labels for years 4 and 5. Therefore, we can analyze how good

are our predictions for all labels, ”+” labels only and ”-” labels

only in years 4 and 5.

Question 1: You have a csv table of daily returns for your

stosk and for S&P-500 (”spy” ticker).

1. For each file, read them into a pandas frame and add a

Page 1

BU MET CS-677: Data Science With Python, v.2.0 CS-677: Predicting Daily Trading Labels

column ”True Label”. In that column, for each day (row)

i with daily return ri ≥ 0 you assign a ” + ” label (”up

day”). For each day i with daily return ri < 0 you assign

” − ” (”down days”). You do this for every day for all 5

years both both tickers.

For example, if your initial dataframe were

Date · · · Return

1/2/2015 · · · 0.015

1/3/2015 · · · -0.01

1/6/2015 · · · 0.02

· · · · · · · · ·

· · · · · · · · ·

12/30/2019 · · · 0

12/31/2019 · · · -0.03

Table 1: Initial data

you will add an additional column ”True Label” and have

data as shown in Table 2.

Your daily ”true labels” sequence is +, −, +, · · · +, −.

2. take years 1,2 and 3. Let L be the number of trading days.

Assuming 250 trading days per year, L will contain about

750 days. Let L

− be all trading days with − labels and

let L

+ be all trading days with + labels. Assuming that

Page 2

BU MET CS-677: Data Science With Python, v.2.0 CS-677: Predicting Daily Trading Labels

Date · · · Return True Label

1/2/2015 · · · 0.015 +

1/3/2015 · · · -0.01 −

1/6/2015 · · · 0.02 +

· · · · · · · · · · · ·

· · · · · · · · · · · ·

12/30/2019 · · · 0 +

12/31/2019 · · · -0.03 −

Table 2: Adding True Labels

all days are independent of each other and that the ratio

of ”up” and ”down” days remains the same in the future,

compute the default probability p

∗

that the next day is a

”up” day.

3. take years 1, 2 and 3 What is the probability that after

seeing k consecutive ”down days”, the next day is an ”up

day”? For example, if k = 3, what is the probability of seeing ”−, −, −, +” as opposed to seeing ”−, −, −, −”. Compute this for k = 1, 2, 3.

4. take years 1, 2 and 3. What is the probability that after

seeing k consecutive ”up days”, the next day is still an

”up day”? For example, if k = 3, what is the probability

of seeing ”+, +, +, +” as opposed to seeing ”+, +, +, −”?

Compute this for k = 1, 2, 3.

Page 3

BU MET CS-677: Data Science With Python, v.2.0 CS-677: Predicting Daily Trading Labels

Predicting labels: We will now describe a procedure to

predict labels for each day in years 4 and 5 from ”true” labels

in training years 1,2 and 3.

For each day d in year 4 and 5, we look at the pattern of

last W true labels (including this day d). By looking at the

frequency of this pattern and true label for the next day in the

training set, we will predict label for day d + 1. Here W is the

hyperparameter that we will choose based on our prediction

accuracy.

Suppose W = 3. You look at a partuclar day d and suppose

that the sequence of last W labels is s = ”−, +, −”. We want

to predict the label for next day d + 1. To do this, we count

the number of sequences of length W + 1 in the training set

where the first W labels coincide with s. In other words, we

count the number N−(s) of sequences ”s, −” and the number

of sequences N+(s) of sequences ”s, +”. If N+(s) ≥ N−(s)

then the next day is assigned ”+”. If N+(s) < N−(s) then the

next day is assigned ”−”. In the unlikely event that N+(s) =

N−(s) = 0 we will assign a label based on default probability

p

∗

that we computed in the previous question.

Question 2:

1. for W = 2, 3, 4, compute predicted labels for each day in

year 4 and 5 based on true labels in years 1,2 and 3 only.

Perform this for your ticker and for ”spy”.

Page 4

BU MET CS-677: Data Science With Python, v.2.0 CS-677: Predicting Daily Trading Labels

2. for each W = 2, 3, 4, compute the accuracy – what percentage of true labels (both positive and negative) have you

predicted correctly for the last two years.

3. which W∗ value gave you the highest accuracy for your

stock and and which W∗ valuegave you the highest accuracy

for S&P-500?

Question 3. One of the most powerful methods to (potentially) improve predictions is to combine predictions by some

”averaging”. This is called ensemble learning. Let us consider

the following procedure: for every day d, you have 3 predicted

labels: for W = 2, W = 3 and W = 4. Let us compute an

”ensemble” label for day d by taking the majority of your labels for that day. For example, if your predicted labels were

”−”,”−” and ”+”, then we would take ”−” as ensemble label

for day d (the majority of three labels is ”−”). If, on the other

hand, your predicted labels were ”−”, ”+” and ”+” then we

would take ”+” as ensemble label for day d (the majority of

predicted labels is ”+”). Compute such ensemble labels and

answer the following:

1. compute ensemble labels for year 4 and 5 for both your

stock and S&P-500.

2. for both S&P-500 and your ticker, what percentage of labels

in year 4 and 5 do you compute correctly by using ensemble?

Page 5

BU MET CS-677: Data Science With Python, v.2.0 CS-677: Predicting Daily Trading Labels

3. did you improve your accuracy on predicting ”−” labels by

using ensemble compared to W = 2, 3, 4?

4. did you improve your accuracy on predicting ”+” labels by

using ensemble compared to W = 2, 3, 4?

Question 4: For W = 2, 3, 4 and ensemble, compute the

following (both for your ticker and ”spy”) statistics based on

years 4 and 5:

1. TP – true positives (your predicted label is + and true label

is +

2. FP – false positives (your predicted label is + but true label

is −

3. TN – true negativess (your predicted label is − and true

label is −

4. FN – false negatives (your predicted label is − but true label

is +

5. TPR = TP/(TP + FN) – true positive rate. This is the fraction of positive labels that your predicted correctly. This is

also called sensitivity, recall or hit rate.

6. TNR = TN/(TN + FP) – true negative rate. This is the

fraction of negative labels that your predicted correctly.

This is also called specificity or selectivity.

Page 6

BU MET CS-677: Data Science With Python, v.2.0 CS-677: Predicting Daily Trading Labels

7. summarize your findings in the table as shown below:

W ticker TP FP TN FN accuracy TPR TNR

2 S&P-500

3 S&P-500

4 S&P-500

ensemble S&P-500

2 your stock

3 your stock

4 your stock

ensemble your stock

Table 3: Prediction Results for W = 1, 2, 3 and ensemble

8. discuss your findings

Question 5: At the beginning of year 4 you start with $100

dollars and trade for 2 years based on predicted labels.

1. take your stock. Plot the growth of your amount for 2 years

if you trade based on best W∗

and on ensemble. On the

same graph, plot the growth of your portfolio for ”buy-andhold” strategy

2. examine your chart. Any patterns? (e.g any differences in

year 4 and year 5)

Page 7