Description
This assignment has two parts. In the first part, you will be creating a logistic regression
model using the data set, “SpeedDating.csv.” In the second part of the assignment, you will
be constructing a one-way ANOVA model using the data set, “kudzu.xls”.
1
Part 1: Logistic Regression
In speed dating, participants meet many people, each for a few minutes, and then decide
who they would like to see again. The data set you will be working with contains information on speed dating experiments conducted on graduate and professional students. Each
person in the experiment met with 10-20 randomly selected people of the opposite sex (only
heterosexual pairings) for four minutes. After each speed date, each participant filled out a
questionnaire about the other person.
Your goal is to build a model to predict which pairs of daters want to meet each other again
(i.e., have a second date). The list of variables are:
We will be using a reduced version of this experimental data with 276 unique male-female
date pairs. In the file “SpeedDating.csv”, the variables have either “M” for male or “F” for
female. For example, “LikeM” refers to the “Like” variable as answered by the male participant (about the female participant). Treat the rating scale variables (such as “PartnerYes”,
”Attractive”, etc.) as numerical variables instead of categorical ones for your analysis.
Page 2 of 4
1. Based on the variable “Decision”, fill out the contingency table below. What percentage
of dates ended with both people wanting a second date?
Decision made by female
No Yes
Decision made by male No
Yes
2. A second date is planned only if both people within the matched pair want to see each
other again. Make a new column in your data set and call it “second.date”. Values in
this column should be 0 if there will be no second date, 1 if there will be a second date.
Construct a scatterplot for each numerical variable where the male values are on the
x-axis and the female values are on the y-axis. Observations in your scatterplot should
have a different color (or pch value) based on whether or not there will be a second date.
Describe what you see. (Note: Jitter your points just for making these plots.)
3. Many of the numerical variables are on rating scales from 1 to 10. Are the responses
within these ranges? If not, what should we do these responses? Is there any missing
data? If so, how many observations and for which variables?
4. What are the possible race categories in your data set? Is there any missing data? If so,
how many observations and what should you do with them? Make a mosaic plot with
female and male race. Describe what you see.
5. Use logistic regression to construct a model for “second.date” (i.e., “second.date” should
be your response variable). Incorporate the discoveries and decisions you made in questions 2, 3, and 4. Explain the steps you used to determine the best model, include the
summary output for your final model only, check your model assumptions, and evaluate your model by running the relevant hypothesis tests. Do not use “Decision” as an
explanatory variable.
6. Redo question (1) using only the observations used to fit your final logistic regression
model. What is your sample size? Does the number of explanatory variables in your
model follow our rule of thumb? Justify your answer.
7. Interpret the slopes in your model. Which explanatory variables increase the probability
of a second date? Which ones decrease it? Is this what you expected to find? Justify.
8. Construct an ROC curve and compute the AUC. Determine the best threshold for classifying observations (i.e., second date or no second date) based on the ROC curve. Justify
your choice of threshold. For your chosen threshold, compute (a) accuracy, (b) sensitivity, and (c) specificity.
Page 3 of 4
Part 2: One-Way ANOVA
Kudzu is a plant that was imported to the United States from Japan and now covers
over seven million acres in the South. The plant contains chemicals called isoflavones
that have been shown to have beneficial effects on bones. One study used three groups
of rats to compare a control group with rats that were fed either a low dose or a high
dose of isoflavones from kudzu. One of the outcomes examined was bone mineral density
in the femur (in grams per square centimeter). Rats were randomly assigned to one of
the three groups. The data can be found in “kudzu.jmp.”
9. Identify the response variable.
10. Identify the factors (and levels) in the experiment.
11. How many treatments are included in the experiment?
12. What type of experimental design is employed?
13. Compute the mean, standard deviation, and sample size for each treatment group and
put the results into a table. Remember to include the units of measurement.
14. Construct side-by-side box plots with connected means. Describe what you see.
15. Are the one-way ANOVA model assumptions satisfied? Justify your answer.
16. Run a one-way ANOVA model and discuss your results. (Let α = 0.01; remember to
include your hypotheses, and identify the test statistic, degrees of freedom, and p-value.)
17. Use Tukey’s multiple-comparisons method to compare the three groups (include the
visual results for the Tukey method). Which groups (if any) have significantly different
means?
Page 4 of 4