STAT451 HW02: Practice with logistic regression and decision tree

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment || To Order Original Work Click Custom Order?

Description

5/5 - (1 vote)

1. Logistic regression

1a. Make a logistic regression model

relating the probability an iris has Species=’virginica’ to its ‘Petal.Length’ and classifying irises as ‘virginica’ or not ‘virginica’ (i.e. ‘versicolor’).

  • Read http://www.stat.wisc.edu/~jgillett/451/data/iris.csv into a DataFrame.
  • Make a second data frame that excludes the ‘setosa’ rows (leaving the ‘virginica’ and ‘versicolor’ rows) and includes only the Petal.Length and Species columns.
  • Train the model using X=𝑋= petal length and y=𝑦= whether the Species is ‘virginica’. (I used “y = (df[‘Species’] == ‘virginica’).to_numpy().astype(int)”, which sets y to zeros and ones.)
  • Report its accuracy on the training data.
  • Report the estimated P(Species=virginica | Petal.Length=5).
  • Report the predicited Species for Petal.Length=5.
  • Make a plot showing:
    • the data points
    • the estimated logistic curve
    • and what I have called the “sample proportion” of y == 1 at each unique Petal.Length value
    • a legend and title and other labels necessary to make the plot easy to read
# ... your code here ...

1b. Do some work with logistic regression by hand.

Consider the logistic regression model, P(yi=1)=11+e(wx+b),.𝑃(𝑦𝑖=1)=11+𝑒−(𝑤𝑥+𝑏),.

Logistic regression is named after the log-odds of success, lnp1pln⁡𝑝1−𝑝, where p=P(yi=1)𝑝=𝑃(𝑦𝑖=1). Show that this log-odds equals wx+b𝑤𝑥+𝑏. (That is, start with lnp1pln⁡𝑝1−𝑝 and connect it in a series of equalities to wx+b𝑤𝑥+𝑏.)

… your Latex math in a Markdown cell here …

lnp1p=...=...=...=...=wx+bln⁡𝑝1−𝑝=…=…=…=…=𝑤𝑥+𝑏

1c. Do some more work with logistic regression by hand.

I ran some Python/scikit-learn code to make the model pictured here: 

From the image and without the help of running code, match each code line from the top list with its output from the bottom list.

  1. model.intercept_
  2. model.coef_
  3. model.predict(X)
  4. model.predict_proba(X)[:, 1]

A. array([0, 0, 0, 1]), B. array([0.003, 0.5, 0.5, 0.997]), C. array([5.832]), D. array([0.])

# ... Your answer here in a Markdown cell ...
# For example, "1: A, 2: B, 3: C, 4: D" is wrong but has the right format.

2. Decision tree

2a. Make a decision tree model on a Titanic data set.

Read the data from http://www.stat.wisc.edu/~jgillett/451/data/kaggle_titanic_train.csv.

These data are described at https://www.kaggle.com/competitions/titanic/data (click on the small down-arrow to see the “Data Dictionary”), which is where they are from.

  • Retain only the Survived, Pclass, Sex, and Age columns.
  • Display the first seven rows (passengers). Notice that the Age column includes NaN, indicating a missing value.
  • Drop rows with missing data via df.dropna(). Display your data frame’s shape before and after dropping rows. (It should be (714, 4) after dropping rows.)
  • Add a column called ‘Female’ that indicates whether a passenger is Female. You can make this column via df.Sex == 'female'. This gives bool values True and False, which are interpreted as 1 and 0 when used in an arithmetic context.
  • Train a decision tree with max_depth=None to decided whether a passenger Survived from the other three columns. Report its accuracy (with 3 decimal places) on training data along with the tree’s depth (which is available in clf.tree_.max_depth).
  • Train another tree with max_depth=2. Report its accuracy (with 3 decimal places). Use tree.plot_tree() to display it, including feature_names to make the tree easy to read.
# ... your code here ...

2b. Which features are used in the (max_depth=2) decision-making? Answer in a markdown cell.

# ... your English text in a Markdown cell here ...

2c. What proportion of females survived? What proportion of males survived?

Answer in two sentences via print(), with each proportion rounded to three decimal places.

Hint: There are many ways to do this. One quick way is to find the average of the Female column for each subset.

# ... your code here ...

2d. Do some decision tree calculations by hand.

Consider a decision tree node containing the following set of examples S=(x,y)𝑆=(𝑥,𝑦) where x=(x1,x2)𝑥=(𝑥1,𝑥2):

((4, 9), 1)

((2, 6), 0)

((5, 7), 0)

((3, 8), 1)

Find the entropy of S𝑆.

# ... your brief work and answer here in a markdown cell ...

2e. Do some more decision tree calculations by hand.

Find a (feature, threshold) pair that yields the best split for this node.

# ... your brief work and answer here in a markdown cell ...