Generalized Linear Models (glm)

Table of Contents

Fit Logistic Regression Models howto

n = 100
x = rnorm(n)
eta = x
p = 1 / (1 + exp(-eta))
y = rbinom(n , size = 1, prob = p)
glm(y ~ x, family = binomial)

Call:  glm(formula = y ~ x, family = binomial)

Coefficients:
(Intercept)            x  
     0.2585       0.8823  

Degrees of Freedom: 99 Total (i.e. Null);  98 Residual
Null Deviance:      138 
Residual Deviance: 119.8    AIC: 123.8

Perform the K-fold cross-validation for glm howto

library(boot)
v = cv.glm(data, model, K = 5)
# raw cross-validation estimate of prediction error
v$delta[1]

cv.glm has also a parameter named cost, which is a function accepts the actual and predcited value and returns cost, a non-negative scalar value.

When data has factor variable, there might be a case that a random sampled group observation doesn't cover all the factor levels. In this case, an error occurs. You can just try it again until it succeeds:

crossvalidation_kfold = function(model, data, K = 5) {
  ret = NULL
  attempt = 0
  MAX_ATTEMPT = 100
  while (is.null(ret) & attempt < MAX_ATTEMPT) {
    attempt = attempt + 1
    try({
      # NOTE: cv.glm() can fail because some split might not contain
      # all the levels of factor variables.
      # So, just retry  until it succeeds.
      # SEE: https://stackoverflow.com/questions/19946930/r-cross-validation-on-a-dataset-with-factors
      ret = cv.glm(data, model, K = K)
    }, silent = TRUE)
  }
  stopifnot(!is.null(ret))
  ret$delta[1]
}

Make predictions with Logistic Regression Models howto

# type = "link" to get the predicted eta value
predict(fit_glm, newdata = somedata, type = "link")

# type = "response" to get the predicted probability of y = 1
predict(fit_glm, newdata = somedata, type = "response")

Calculate the confidence interval for mean response of glm howto

# se.fit = TRUE to get the standard error
eta_hat = predict(model, newdata, se.fit = TRUE, type = "link")

# level = 0.95
critical_value = qnorm(0.975)
margin = critical_value * eta_hat$se.fit

# Calculate the link interval
lower_eta = eta_hat$fit - margin
upper_eta = eta_hat$fit + margin

# Calculate the response interval
p = function(eta) 1 / (1 + exp(-eta))
c(p(lower_eta), p(upper_eta))

Perform Likelihood Ratio Test howto

anova(small_model, larger_model, test = "LRT")

Deal with glm.fit: algorithm did not converge howto

fit = glm(..., maxit = 50)

Use the glm model as a classifier howto

# levels: yes no
pred = ifelse(predict(model, data, type = "response") > 0.5, "yes", "no")

# Misclassification Rate
mean(pred != y)

Calculate Sensitivity and Specificity howto

calc_sensitivity = function(predicted, actual) {
  tbl = table(predicted, actual)
  TP = tbl["no", "yes"]
  FN = tbl["no", "yes"]
  TP / (TP + FN)
}

calc_specificity = function(predicted, actual) {
  tbl = table(predicted, actual)
  TN = tbl["no", "no"]
  FP = tbl["yes", "no"]
  TN / (TN + FP)
}

Include all 2-way interactions in a glm model in R howto

glm(y ~ . * ., data)