MSSC 6250 Statistical Machine Learning
In many applications, the response or error term are nonnormal.
Besides normal, the response in a generalized linear model (GLM) can be Bernoulli, binomial, Poisson, etc.
There is no assumption that \(\mathrm{Var}(y_i)\) is homogeneous: Both \(E(y_i)\) and \(\mathrm{Var}(y_i)\) may vary with the regressors from data point to data point.
Ordinary least squares does not apply when the response is not normal.
Normal vs. COVID vs. Smoking
fake news vs. true news
Two popular approaches when modeling binary data
Given the training data \(\{(x_i, y_i)\}_{i=1}^n\), we build a classifier to predict whether people will default on their credit card payment \((Y)\) yes
or no
, based on monthly credit card balance \((X)\).
\[Y =\begin{cases} 0 & \quad \text{if not default}\\ 1 & \quad \text{if default} \end{cases}\]
What are potential issues with this dummy variable approach?
\[Y =\begin{cases} 1 & \quad \text{stroke}\\ 2 & \quad \text{drug overdose} \\ 3 & \quad \text{epileptic seisure} \end{cases}\]
The coding
epileptic seisure
\(>\) drug overdose
\(>\) stroke
stroke
\(-\) drug overdose
\(=\) drug overdose
\(-\) epileptic seizure
.default
using a S-shaped curve.We use normal distribution for numerical \(y\). What distribution we can use for binary \(y\) that takes value 0 or 1?
default
\((y = 1)\) and not default
\((y = 0)\) is a Bernoulli variable. But,\[ y_i \mid x_i \stackrel{indep}{\sim} \text{Bernoulli}(\pi_i = \pi(x_i)) = \text{binomial}(m=1,\pi = \pi_i) \]
balance
. \(x_1 = 2000\) has a larger \(\pi_1 = \pi(2000)\) than \(\pi_2 = \pi(500)\) with \(x_2 = 500\)
Instead of predicting \(y_i\) directly, we use the predictors to model its probability of success, \(\pi_i\).
But how?
\[\eta = \text{logit}(\pi) = \ln\left(\frac{\pi}{1-\pi}\right)\]
For \(i = 1, \dots, n\) and with \(p\) predictors: \[Y_i \mid \pi_i({\bf x}_i) \stackrel{indep}{\sim} \text{Bernoulli}(\pi_i), \quad {\bf x}_i' = (x_{i1}, \dots, x_{ip})\] \[\text{logit}(\pi_i) = \ln \left( \frac{\pi_i}{1 - \pi_i} \right) = \eta_i = \beta_0+\beta_1 x_{i1} + \cdots + \beta_p x_{ip} = {\bf x}_i'\boldsymbol \beta\]
\[\small E(Y_i) = \pi_i = \frac{1}{1 + \exp(-{\bf x}_i'\boldsymbol \beta)}\] \[\small \hat{\pi}_i = \frac{1}{1 + \exp(-{\bf x}_i'\hat{\boldsymbol \beta} )}\]
Default
Data in ISLR2
default student balance income
1 No No 730 44362
2 No Yes 817 12106
3 No No 1074 31767
4 No No 529 35704
5 No No 786 38463
6 No Yes 920 7492
7 No No 826 24905
8 No Yes 809 17600
9 No No 1161 37469
10 No No 0 29275
str(Default)
'data.frame': 10000 obs. of 4 variables:
$ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
$ balance: num 730 817 1074 529 786 ...
$ income : num 44362 12106 31767 35704 38463 ...
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.6513 0.36116 -29 3.6e-191
balance 0.0055 0.00022 25 2.0e-137
\(\hat{\eta} = \text{logit}(\hat{\pi}) = \ln \left( \frac{\hat{\pi}}{1 - \hat{\pi}}\right) = -10.65 + 0.0055 \times \text{balance}\)
The ratio \(\frac{\pi}{1-\pi} \in (0, \infty)\) is called the odds of some event.
\[\ln \left( \frac{\pi(x)}{1 - \pi(x)} \right)= \beta_0 + \beta_1x\]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.651 0.36 -29 0
balance 0.005 0.00 25 0
balance
increases the log odds of default
by 0.005 units.default
increases by 0.5% with additional one unit of credit card balance
.\[\log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) = -10.65+0.0055\times 2000\]
\[ \hat{\pi} = \frac{1}{1+\exp(-(-10.65+0.0055 \times 2000)} = 0.586\]
pi_hat <- predict(logit_fit, type = "response")
eta_hat <- predict(logit_fit, type = "link") ## default gives us b0 + b1*x
predict(logit_fit, newdata = data.frame(balance = 2000), type = "response")
1
0.59
What is the probability of default when the balance is 500? What about balance 2500?
glm()
uses maximum likelihood to estimate the parameters \(\boldsymbol \beta\).The log-likelihood is given by
\[\ell(\boldsymbol \beta) = \sum_{i=1}^n \log \, p(y_i \mid x_i, \boldsymbol \beta).\]
Using Bernoulli probabilities, we have
\[\begin{align} \ell(\boldsymbol \beta) =& \sum_{i=1}^n \log \left\{ \pi(\mathbf{x}_i)^{y_i} [1-\pi(\mathbf{x}_i)]^{1-y_i} \right\}\\ =& \sum_{i=1}^n y_i \log \frac{\pi(\mathbf{x}_i)}{1-\pi(\mathbf{x}_i)} + \log [1-\pi(\mathbf{x}_i)] \\ =& \sum_{i=1}^n y_i \mathbf{x}_i' \boldsymbol \beta- \log [ 1 + \exp(\mathbf{x}_i' \boldsymbol \beta)] \end{align}\]
\[\boldsymbol \beta^{\text{new}} = \boldsymbol \beta^{\text{old}} - \left[ \frac{\partial ^2 \ell(\boldsymbol \beta)}{\partial \boldsymbol \beta\partial \boldsymbol \beta'}\right] ^{-1}\frac{\partial \ell(\boldsymbol \beta)}{\partial \boldsymbol \beta}\]
optim()
x <- as.matrix(cbind("intercept" = 1, Default$balance))
y <- as.matrix(as.numeric(Default$default) - 1)
beta <- rep(0, ncol(x)) ## start with 0
optim(beta, fn = my_loglik, gr = my_gradient, method = "BFGS",
x = x, y = y,
control = list("fnscale" = -1))$par # "fnscale" = -1 for maximization
[1] -10.6454 0.0055
True 0 | True 1 | |
---|---|---|
Predict 0 | True Negative (TN) | False Negative (FN) |
Predict 1 | False Positive (FP) | True Positive (TP) |
Sensitivity (True Positive Rate) \(= P( \text{predict 1} \mid \text{true 1}) = \frac{TP}{TP+FN}\)
Specificity (True Negative Rate) \(= P( \text{predict 0} \mid \text{true 0}) = \frac{TN}{FP+TN}\)
Accuracy \(= \frac{TP + TN}{TP+FN+FP+TN} = \frac{1}{m}\sum_{j=1}^mI(y_j = \hat{y}_j)\), where \(y_j\)s are true test labels and \(\hat{y}_j\)s are their corresponding predicted label.
A good classifier can be the one which the test accuracy rate is highest, or test error rate \(\frac{1}{m}\sum_{j=1}^mI(y_j \ne \hat{y}_j)\) is smallest.
library(ROCR)
# create an object of class prediction
pred <- ROCR::prediction(
predictions = pred_prob,
labels = Default$default)
# calculates the ROC curve
roc <- ROCR::performance(
prediction.obj = pred,
measure = "tpr",
x.measure = "fpr")
plot(roc, colorize = TRUE, lwd = 3)
Find the area under the curve:
## object of class 'performance'
auc <- ROCR::performance(
prediction.obj = pred,
measure = "auc")
auc@y.values
[[1]]
[1] 0.95
From the training set, choose the cut-off that maximizes
G-Mean \(= \sqrt{\text{TPR} * \text{TNR}}\) (geometric mean of TPR and TNR)
Youden’s J index \(= \text{TPR} + \text{TNR} - 1\) (the distance to the 45 degree identity line)
or that minimizes
With Youden’s J index, the optimal threshold is \(3.18\%\).
When classifying \(K > 2\) categories, we can consider multinomial logistic regression1.
The response should be nominal. If \(Y_i\) is ordinal, we should consider the ordinal regression.
For \(k = 1, 2, \dots, K-1,\)
\[Pr(Y = k \mid {\bf x}) = \dfrac{e^{\beta_{k0}+\beta_{k1}x_1 + \cdots + \beta_{kp}x_p}}{ \sum_{l=1}^{K}e^{\beta_{l0}+\beta_{l1}x_1 + \cdots + \beta_{lp}x_p}},\] and \[Pr(Y = K \mid {\bf x}) = \dfrac{1}{ \sum_{l=1}^{K}e^{\beta_{l0}+\beta_{l1}x_1 + \cdots + \beta_{lp}x_p}}.\]
For \(k = 1, \dots, K-1\), \[\log\left( \frac{Pr(Y = k \mid \mathbf{x})}{Pr(Y = K \mid \mathbf{x})} \right) = \beta_{k0}+\beta_{k1}x_1 + \cdots + \beta_{kp}x_p.\]
First select a single class to serve as the baseline, the \(K\)th class for example.1
\(\pi_{ik} = Pr(Y_i = k \mid {\bf x}_i)\): the probability that the \(i\)-th response falls in the \(k\)-th category.
\(\sum_{k = 1}^K \pi_{ik} = 1\) for each observation \(i\).
What is the value of \(\eta_{iK}\)?
For \(k = 1, \dots, K\),
\[\pi_{ik} = \dfrac{e^{\eta_{ik}}}{\sum_{l=1}^{K}e^{\eta_{il}}}; \left(\pi_{iK} = \dfrac{1}{\sum_{l=1}^{K}e^{\eta_{il}}} \right)\]
Rows: 200
Columns: 3
$ prog <fct> vocation, general, vocation, vocation, vocation, general, vocati…
$ ses <fct> low, middle, high, low, middle, high, middle, middle, middle, mi…
$ write <dbl> 35, 33, 39, 37, 31, 36, 36, 31, 41, 37, 44, 33, 31, 44, 35, 44, …
prog
(program type: general
, academic
, vocation
)ses
(social economic status: low
, middle
, high
)write
(writing score)Program type is affected by social economic status
prog
ses general academic vocation
low 16 19 12
middle 20 44 31
high 9 42 7
and writing score
M SD
general 51 9.4
academic 56 7.9
vocation 47 9.3
The level (coding order) of prog
is
[1] "general" "academic" "vocation"
[1] "academic" "general" "vocation"
academic
= 1 (baseline), general
= 2, vocation
= 3ses
, low
= 1 (baseline), middle
= 2, high
= 3\[\log \left( \frac{Pr(prog = general)}{Pr(prog = academic)}\right) = b_{10} + b_{11}I(ses = 2) + b_{12}I(ses = 3) + b_{13}write\]
\[\log \left( \frac{Pr(prog = vocation)}{Pr(prog = academic)}\right) = b_{20} + b_{21}I(ses = 2) + b_{22}I(ses = 3) + b_{23}write\]
All others held constant,
\(b_{13}\): the amount changed in the log odds of being in general
vs. academic
program as write
increases one unit.
\(b_{21}\) the amount changed in the log odds of being in vocation
vs. academic
program if moving from ses = "low"
to ses = "middle"
.
nnet::multinom()
multino_fit <- nnet::multinom(prog2 ~ ses + write, data = multino_data, trace = FALSE)
summ <- summary(multino_fit)
summ$coefficients
(Intercept) sesmiddle seshigh write
general 2.9 -0.53 -1.16 -0.058
vocation 5.2 0.29 -0.98 -0.114
glmnet()
uses the softmax coding that treats all \(K\) classes symmetrically, and coefficients have a different meaning. (ISL eq. (4.13), (4.14))
The fitted values, predictions, log odds between any pair of classes, and other key model outputs remain the same.
glmnet(x = input_matrix, y = categorical_vector, family = "multinom",
family = "multinomial", lambda = 0)
academic general vocation
1 0.15 0.34 0.51
2 0.12 0.18 0.70
3 0.42 0.24 0.34
4 0.17 0.35 0.48
5 0.10 0.17 0.73
6 0.35 0.24 0.41
write
at its mean and examine the predicted probabilities for each level of ses
.dses <- data.frame(ses = c("low", "middle", "high"),
write = mean(multino_data$write))
predict(multino_fit, newdata = dses, type = "probs")
academic general vocation
1 0.44 0.36 0.20
2 0.48 0.23 0.29
3 0.70 0.18 0.12
Logistic regression uses the logit function \(\eta = g(\pi) = \ln\left( \dfrac{\pi}{1-\pi}\right)\) as the link function.
\(\pi = \dfrac{1}{1 + e^{-\eta}} \in (0, 1)\) is the CDF of the logistic distribution, whose PDF is of the from \[f(\eta) = \frac{e^{-\eta}}{(1 + e^{-\eta})^2}, \quad \eta \in(-\infty, \infty)\]
Could use a different link function and hence CDF to describe the probability curve.
Distribution | Link Function \(\eta = g(\pi)\) | CDF \(\pi = g^{-1}(\eta)\) | |
---|---|---|---|
Logistic | logit: \(\ln\left( \dfrac{\pi}{1-\pi}\right)\) | \(\dfrac{1}{1 + e^{-\eta}}\) | \(\dfrac{e^{-\eta}}{(1 + e^{-\eta})^2}\) |
Normal | probit: \(\Phi^{-1}(\pi)\) | \(\Phi(\eta)\) | \(\frac{1}{\sqrt{2\pi}} e^{-\frac{\eta^2}{2}}\) |
Gumbel | cloglog: \(\log\left( -\log(1-\pi)\right)\) | \(1 - \exp(-\exp(\eta))\) | \(e^{-(\eta + e ^ {-\eta})}\) |
Any transformation that maps probabilities into the real line could be used to produce a GLM for binary classification, as long as the transformation is one-to-one continuous and differentiable.
\[\pi = F(\eta)\] \[\eta = F^{-1}(\pi)\]
Name | Link Function \(\eta = g(\pi)\) |
---|---|
logit | \(\ln\left( \dfrac{\pi}{1-\pi}\right)\) |
probit | \(\Phi^{-1}(\pi)\) |
cloglog | \(\log\left( -\log(1-\pi)\right)\) |
Repeated measures and binomial logistic regression
Regression diagnostics for binary data
Penalized logistic regression
Exponential family
Poisson regression