Generalized Linear Models (glm
)
Table of Contents
- Fit Logistic Regression Models
- Perform the K-fold cross-validation for glm
- Make predictions with Logistic Regression Models
- Calculate the confidence interval for mean response of glm
- Perform Likelihood Ratio Test
- Deal with
glm.fit: algorithm did not converge
- Use the
glm
model as a classifier - Calculate Sensitivity and Specificity
- Include all 2-way interactions in a glm model in R
Fit Logistic Regression Models howto
n = 100
x = rnorm(n)
eta = x
p = 1 / (1 + exp(-eta))
y = rbinom(n , size = 1, prob = p)
glm(y ~ x, family = binomial)
Call: glm(formula = y ~ x, family = binomial)
Coefficients:
(Intercept) x
0.2585 0.8823
Degrees of Freedom: 99 Total (i.e. Null); 98 Residual
Null Deviance: 138
Residual Deviance: 119.8 AIC: 123.8
- For
binomial
andquasibinomial
families the response can also be specified as a factor, where the first level denotes failure(0) and all others success(1). - Interpret the coefficients as the change in
link
values, not the response itself.
Perform the K-fold cross-validation for glm howto
library(boot)
v = cv.glm(data, model, K = 5)
# raw cross-validation estimate of prediction error
v$delta[1]
cv.glm
has also a parameter named cost
, which is a function accepts the actual
and predcited
value and returns cost
, a non-negative scalar value.
When data
has factor variable, there might be a case that a random sampled group observation doesn't cover all the factor levels. In this case, an error occurs. You can just try it again until it succeeds:
crossvalidation_kfold = function(model, data, K = 5) {
ret = NULL
attempt = 0
MAX_ATTEMPT = 100
while (is.null(ret) & attempt < MAX_ATTEMPT) {
attempt = attempt + 1
try({
# NOTE: cv.glm() can fail because some split might not contain
# all the levels of factor variables.
# So, just retry until it succeeds.
# SEE: https://stackoverflow.com/questions/19946930/r-cross-validation-on-a-dataset-with-factors
ret = cv.glm(data, model, K = K)
}, silent = TRUE)
}
stopifnot(!is.null(ret))
ret$delta[1]
}
Make predictions with Logistic Regression Models howto
# type = "link" to get the predicted eta value
predict(fit_glm, newdata = somedata, type = "link")
# type = "response" to get the predicted probability of y = 1
predict(fit_glm, newdata = somedata, type = "response")
Calculate the confidence interval for mean response of glm howto
# se.fit = TRUE to get the standard error
eta_hat = predict(model, newdata, se.fit = TRUE, type = "link")
# level = 0.95
critical_value = qnorm(0.975)
margin = critical_value * eta_hat$se.fit
# Calculate the link interval
lower_eta = eta_hat$fit - margin
upper_eta = eta_hat$fit + margin
# Calculate the response interval
p = function(eta) 1 / (1 + exp(-eta))
c(p(lower_eta), p(upper_eta))
Perform Likelihood Ratio Test howto
Deal with glm.fit: algorithm did not converge
howto
- Increase
maxit
argument when fitting, which is25
by default.
Use the glm
model as a classifier howto
# levels: yes no
pred = ifelse(predict(model, data, type = "response") > 0.5, "yes", "no")
# Misclassification Rate
mean(pred != y)
Calculate Sensitivity and Specificity howto
calc_sensitivity = function(predicted, actual) {
tbl = table(predicted, actual)
TP = tbl["no", "yes"]
FN = tbl["no", "yes"]
TP / (TP + FN)
}
calc_specificity = function(predicted, actual) {
tbl = table(predicted, actual)
TN = tbl["no", "no"]
FP = tbl["yes", "no"]
TN / (TN + FP)
}