Lab: Discriminant Functions
R-squared
This question requires some on-your-own reading. Linear regression models are often evaluated using metrics such as r-squared, otherwise known as the coefficient of determination. Read about it here.
The basic idea behind sums of squares is:
- Find the differences between one point and another static value (like the mean of a data set). Repeat for all points.
- Square each difference.
- Add all squared differences together.
For the following questions, use this formula for R-squared.
- is the sum of squares between the actual y values and predicted y values and the actual values.
- is the sum of squares between the actual y values and the mean of the y values.
By dividing the two, you get an idea about how far the actual points are away from your prediction line, compared to how far away each point is to the average of all actual points (a flat horizontal line). It tells you how “good”, or how predictive, your model is. The lower the SS_res, the higher the resultant r-squared.
Assume that you have a model that makes the predictions found in the file cars.csv. Use whatever tool you like to answer the following questions.
Question 1:
What is SSres for the data above?
Question 2:
What is SS_tot for the data above?
Question 3:
What is the R-squared for the data above?
Logistic Regresion
Let’s say you induce a logistic regression model to predict whether someone will default on a loan.
The main effect of dti
(debt to income ratio), the main effect of grade
, and the interaction of dti
and grade
were modeled to predict the likelihood of defaulting. An “interaction” is just like a regular feature, except that the interaction “weight” is assigned to the multiplication of two features. In the below case, it in effect gives a differential slope for dti
to individuals in different grades. The “main effect” of dti
is specified, as is the “main effect” for the different grades, along with an “interaction effect” for dti:grade
.
If you are curious: In an R formula, y ~ dti * grade
expands to y ~ dti + grade + dti:grade
,
where a colon specifies an interaction.
The model output provides the following summary:
dti is a ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self reported monthly income.
loan-status is a binary feature coded as: 0 = Fully Paid and 1 = Charged off (default).
grade is lending club assigned loan grade- “A” is good, “G” is bad, etc.
Question 4:
Refer to the model output above. What value would this model directly predict (y) for someone with a dti of .94 and a grade of D?
y = 2.057
y = 2.027
y = -1.490
y = 2.028
Question 5:
Refer to the model output above. What value would this model predict for someone with a dti of .3 and a grade of A?
y = -3.502
y = 0.014
y = 0.048
y = -3.469
Question 6:
Assume that the model above make a direct prediction of y = .2. What would this be in terms of probability?
p = .200
p = .4
p = .550
p = .450