Description

5/5 - (1 vote)

1. Logistic regression

1a. Make a logistic regression model

relating the probability an iris has Species=’virginica’ to its ‘Petal.Length’ and classifying irises as ‘virginica’ or not ‘virginica’ (i.e. ‘versicolor’).

Read http://www.stat.wisc.edu/~jgillett/451/data/iris.csv into a DataFrame.
Make a second data frame that excludes the ‘setosa’ rows (leaving the ‘virginica’ and ‘versicolor’ rows) and includes only the Petal.Length and Species columns.
Train the model using $X =$ petal length and $y =$ whether the Species is ‘virginica’. (I used “y = (df[‘Species’] == ‘virginica’).to_numpy().astype(int)”, which sets y to zeros and ones.)
Report its accuracy on the training data.
Report the estimated P(Species=virginica | Petal.Length=5).
Report the predicited Species for Petal.Length=5.
Make a plot showing:
- the data points
- the estimated logistic curve
- and what I have called the “sample proportion” of y == 1 at each unique Petal.Length value
- a legend and title and other labels necessary to make the plot easy to read

# ... your code here ...

1b. Do some work with logistic regression by hand.

Consider the logistic regression model, $P (y_{i} = 1) = \frac{1}{1 + e^{- (w x + b)}}, .$

Logistic regression is named after the log-odds of success, $\ln \frac{p}{1 - p}$ , where $p = P (y_{i} = 1)$ . Show that this log-odds equals $w x + b$ . (That is, start with $\ln \frac{p}{1 - p}$ and connect it in a series of equalities to $w x + b$ .)

… your Latex math in a Markdown cell here …

$\begin{aligned} \ln \frac{p}{1 - p} & = . . . \\ = . . . \\ = . . . \\ = . . . \\ = w x + b \end{aligned}$

1c. Do some more work with logistic regression by hand.

I ran some Python/scikit-learn code to make the model pictured here:

From the image and without the help of running code, match each code line from the top list with its output from the bottom list.

model.intercept_
model.coef_
model.predict(X)
model.predict_proba(X)[:, 1]

A. array([0, 0, 0, 1]), B. array([0.003, 0.5, 0.5, 0.997]), C. array([5.832]), D. array([0.])

# ... Your answer here in a Markdown cell ...
# For example, "1: A, 2: B, 3: C, 4: D" is wrong but has the right format.

2. Decision tree

2a. Make a decision tree model on a Titanic data set.

Read the data from http://www.stat.wisc.edu/~jgillett/451/data/kaggle_titanic_train.csv.

These data are described at https://www.kaggle.com/competitions/titanic/data (click on the small down-arrow to see the “Data Dictionary”), which is where they are from.

Retain only the Survived, Pclass, Sex, and Age columns.
Display the first seven rows (passengers). Notice that the Age column includes NaN, indicating a missing value.
Drop rows with missing data via df.dropna(). Display your data frame’s shape before and after dropping rows. (It should be (714, 4) after dropping rows.)
Add a column called ‘Female’ that indicates whether a passenger is Female. You can make this column via df.Sex == 'female'. This gives bool values True and False, which are interpreted as 1 and 0 when used in an arithmetic context.
Train a decision tree with max_depth=None to decided whether a passenger Survived from the other three columns. Report its accuracy (with 3 decimal places) on training data along with the tree’s depth (which is available in clf.tree_.max_depth).
Train another tree with max_depth=2. Report its accuracy (with 3 decimal places). Use tree.plot_tree() to display it, including feature_names to make the tree easy to read.

# ... your code here ...

2b. Which features are used in the (max_depth=2) decision-making? Answer in a markdown cell.

# ... your English text in a Markdown cell here ...

2c. What proportion of females survived? What proportion of males survived?

Answer in two sentences via print(), with each proportion rounded to three decimal places.

Hint: There are many ways to do this. One quick way is to find the average of the Female column for each subset.

# ... your code here ...

2d. Do some decision tree calculations by hand.

Consider a decision tree node containing the following set of examples $S = (x, y)$ where $x = (x_{1}, x_{2})$ :

((4, 9), 1)

((2, 6), 0)

((5, 7), 0)

((3, 8), 1)

Find the entropy of $S$ .

# ... your brief work and answer here in a markdown cell ...

2e. Do some more decision tree calculations by hand.

Find a (feature, threshold) pair that yields the best split for this node.

# ... your brief work and answer here in a markdown cell ...

STAT451 HW02: Practice with logistic regression and decision tree