MSSC 6250 Statistical Machine Learning
When flipping a fair coin, we say that “the probability of flipping Heads is 0.5.” How do you interpret this probability?
An election is coming up and a pollster claims that candidate Yu has a 0.9 probability of winning. How do you interpret this probability?
Two claims.
(1) Ben claims he can predict the coin flip outcome. To test his claim, you flip a fair coin 8 times and he correctly predicts all.
(2) Emma claims she can distinguish natural and artificial sweeteners. To test her claim, you give her 8 samples and she correctly identifies each.
In light of these experiments, what do you conclude?
Suppose that during a doctor’s visit, you tested positive for COVID. If you only get to ask the doctor one question, which would it be?
The probability that some outcome of a process will be obtained is defined as
the relative frequency with which that outcome would be obtained if the process were repeated a large number of times independently under similar conditions.
Frequency Relative Frequency
Heads 4 0.4
Tails 6 0.6
Total 10 1.0
---------------------
Frequency Relative Frequency
Heads 512 0.512
Tails 488 0.488
Total 1000 1.000
---------------------
David Spiegelhalter (2024). Why probability probably doesn’t exist (but it is useful to act like it does). Nature, 636, p. 560 - 563.
How can we live if we don’t change? - Beyoncé. Lyric from “Satellites.”
Using data and prior beliefs to update our knowledge (posterior), and repeating.
We continuously update our knowledge about the world as we accumulate lived experiences, or collect data.
In Question 4,
Test Positive | Test Negative | Total | |
---|---|---|---|
COVID | 3 | 1 | 4 |
No COVID | 9 | 87 | 96 |
Total | 12 | 88 | 100 |
\(H_0\): Do not have COVID vs. \(H_1\): Have COVID
A frequestist assesses the uncertainty of the observed data in light of an assumed hypothesis \(P(Data \mid H_0) = 9/96\)
A Bayesian assesses the uncertainty of the hypothesis in light of the observed data \(P(H_0 \mid Data) = 9/12\)
!
is more consistent with fake news (Evidence). title_has_excl fake real
FALSE 44 88
TRUE 16 2
Total 60 90
\(F\): an article is fake.
The prior probability model
Event | \(F\) | \(F^c\) | Total |
---|---|---|---|
Probability \(P(\cdot)\) | 0.4 | 0.6 | 1 |
title_has_excl fake real
FALSE 44 88
TRUE 16 2
Total 60 90
\(D\): an article title has exclamation mark.
Conditional probability: \(P(D \mid {\color{blue}{F}}) = 16/60 = 27\%\); \(P(D \mid {\color{blue}{F^c}}) = 2/90 = 2\%\).
!
(observed data \({\color{blue}{D}}\))\[L(F \mid {\color{blue}{D}}) = P({\color{blue}{D}} \mid F) \text{ and } L(F^c \mid {\color{blue}{D}}) = P({\color{blue}{D}} \mid F^c)\]
Event | \(F\) | \(F^c\) | Total |
---|---|---|---|
Probability \(P(\cdot)\) | 0.4 | 0.6 | 1 |
Likelihood \(L(\cdot \mid D)\) | 0.27 | 0.02 | 0.29 |
\[\begin{align*} P(F \mid D) &= \frac{P(F \cap D)}{P(D)}\\ &= \frac{L(F \mid D)P(F)}{P(D)}\end{align*}\]
\[\text{posterior = } \frac{\text{likelihood} \cdot \text{prior }}{ \text{normalizing constant}} \]
“The president has a funny secret!”
a feature that’s more common to fake news.
Event | \(F\) | \(F^c\) | Total |
---|---|---|---|
Prior prob \(P(\cdot)\) | 0.4 | 0.6 | 1 |
Posterior prob \(P(\cdot \mid D)\) | 0.89 | 0.11 | 1 |
For any random variables parameter \(\theta\) and data \({\bf Y} = (Y_1, \dots, Y_n)\),
\(\pi(\theta)\): the prior pmf/pdf of \(\theta\)
\(L(\theta \mid y_1,\dots, y_n)\): the likelihood of \(\theta\) given observed data \(\mathbf{y}= \{y_i \}_{i = 1}^n\).
The posterior distribution of \(\theta\) given \(\mathbf{y}\) is
\[\pi(\theta \mid \mathbf{y}) = \frac{L(\theta \mid \mathbf{y})\pi(\theta)}{p(\mathbf{y})}\] where \[p(\mathbf{y}) = \begin{cases} \int_{\Theta} L(\theta \mid \mathbf{y})\pi(\theta) ~ d\theta & \text{if } \theta \text{ is continuous }\\ \sum_{\theta \in \Theta} L(\theta \mid \mathbf{y})\pi(\theta) & \text{if } \theta \text{ is discrete } \end{cases}\]
\[\pi(\theta \mid \mathbf{y}) = \frac{L(\theta \mid \mathbf{y})\pi(\theta)}{p(\mathbf{y})} \propto_{\theta} L(\theta \mid \mathbf{y})\pi(\theta)\]
\[\text{posterior } \propto \text{ likelihood } \cdot \text{ prior } \]
Key: Describe prior and data information using probabilistic models.
\[\pi(\theta \mid \alpha, \beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha - 1}(1-\theta)^{\beta-1}\]
\(\theta \sim \text{beta}(\alpha, \beta)\)
In the prior model, \(\alpha\) and \(\beta\) are hyperparameters to be chosen to reflect our prior information.
Michelle’s support is centered round 45%, and she polled at around 35% in the dreariest days and around 55% in the best days.
Choose \(\alpha\) and \(\beta\) so that the prior mean is about 0.45 and the range is from 0.35 to 0.55.
\(\mathrm{E}(\theta) = \frac{\alpha}{\alpha + \beta}\)
\(\mathrm{Var}(\theta) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}\)
You plan to conduct a new poll of \(n = 50\) Cheeseheads and record \(Y\), the number that support Michelle.
What distribution can be used for modeling likelihood connecting the data \(y\) and the parameter we are interested, \(\theta\)?
\[Y \mid \theta \sim \text{binomial}(n=50, \theta)\]
\[L(\theta \mid y = 30) = {50 \choose 30}\theta^{30}(1-\theta)^{20}, \quad \theta \in (0, 1)\]
\[\begin{align}Y \mid \theta &\sim \text{binomial}(n=50, \theta)\\ \theta &\sim \text{beta}(45, 55) \end{align}\]
Goal: Obtain the posterior distribution \(\pi(\theta \mid y)\).
\[ \begin{align} \pi(\theta \mid y) &\propto_{\theta} L(\theta \mid y)\pi(\theta) \\ &= {50 \choose 30}\theta^{30}(1-\theta)^{20} \times \frac{\Gamma(100)}{\Gamma(45)\Gamma(55)}\theta^{44}(1-\theta)^{54}\\ &\propto_{\theta} \theta^{74}(1-\theta)^{74}\\ &= \frac{\Gamma(150)}{\Gamma(75)\Gamma(75)} \theta^{74}(1-\theta)^{74} \\ &= \text{beta}(75, 75)\end{align} \]
using the fact that \(\int_{\mathcal{X}} f(x) dx = 1\) for any pdf \(f(x)\).
Bayesian/Probabilistic
Frequentist/Classical
In simple linear regression,
\[Y_i \mid \beta_0, \beta_1, \sigma \stackrel{ind}{\sim} N\left(\mu_i, \sigma^2 \right) \quad \text{with} \quad \mu_i = \beta_0 + \beta_1 X_i\]
This normal data model is our likelihood \(L(\boldsymbol \theta= (\beta_0, \beta_1, \sigma) \mid \mathcal{D} = \{\mathbf{y}, \mathbf{x}\})\), as it evaluates how the data are compatible with different possible values of parameters.
To be a Bayesian, what do we do next?
\[\begin{align} \beta_0 \sim N(m_0, s_0^2) \\ \beta_1 \sim N(m_1, s_1^2) \end{align}\]
\[\begin{align} \sigma \sim \text{Exp}(\lambda) \end{align}\] \(\pi(\sigma) = \lambda e^{-\lambda \sigma}\), \(\lambda > 0\) and \(\mathrm{E}(\sigma) = 1/\lambda\), and \(\mathrm{Var}(\sigma) = 1/\lambda^2\)
\[\begin{align} Y_i \mid \beta_0, \beta_1, \sigma &\stackrel{ind}{\sim} N\left(\mu_i, \sigma^2 \right) \quad \text{with} \quad \mu_i = \beta_0 + \beta_1 X_i \\ \beta_0 &\sim N(m_0, s_0^2) \\ \beta_1 &\sim N(m_1, s_1^2) \\ \sigma &\sim \text{Exp}(\lambda) \end{align}\]
bayesrules::bikes
Data in Washington, D.C.
Rows: 500
Columns: 2
$ rides <int> 654, 1229, 1454, 1518, 1362, 891, 1280, 1220, 1137, 1368, 13…
$ temp_feel <dbl> 64.7, 49.0, 51.1, 52.6, 50.8, 46.6, 45.6, 49.2, 46.4, 45.6, …
\(\beta_{0c} \sim N(5000, 1000^2)\)
\(\beta_{1} \sim N(100, 40^2)\)
\(\sigma \sim \text{Exp}(0.0008)\) because \(\mathrm{E}(\sigma) = 1/\lambda = 1/0.0008 = 1250\)
Update our prior understanding of the relationship between ridership and temperature using data information provided by likelihood.
The joint posterior distribution is
\[\pi(\beta_0, \beta_1, \sigma \mid \mathbf{y}) = \frac{L(\beta_0, \beta_1, \sigma \mid \mathbf{y})\pi(\beta_0, \beta_1, \sigma)}{p(\mathbf{y})}\] where
\(L(\beta_0, \beta_1, \sigma \mid \mathbf{y}) = p(\mathbf{y}\mid \beta_0, \beta_1, \sigma) = \prod_{i=1}^np(y_i \mid \beta_0, \beta_1, \sigma)\)
\(\pi(\beta_0, \beta_1, \sigma) = \pi(\beta_0)\pi(\beta_1)\pi(\sigma)\)
\(p(\mathbf{y}) = \int \int \int \left[\prod_{i=1}^np(y_i \mid \beta_0, \beta_1, \sigma)\right]\pi(\beta_0)\pi(\beta_1)\pi(\sigma) ~ d\beta_0d\beta_1d\sigma\)
There are lots of ways to generate/approximate the posterior distribution of parameters. One method is Markov chain Monte Carlo (MCMC).
rstanarm::stan_glm()
rstanarm
uses RStan syntax1 to do Bayesian inference for applied regression models (arm).bike_model <- rstanarm::stan_glm(rides ~ temp_feel, data = bikes,
family = gaussian,
prior_intercept = normal(5000, 1000),
prior = normal(100, 40),
prior_aux = exponential(0.0008),
chains = 4, iter = 5000*2, seed = 2025)
Generate 4 Monte Carlo chains (chains = 4
), each having 10000 iterations (iter = 5000*2
).
Each iteration draw a posterior sample of the \(\beta_0\), \(\beta_1\), and \(\sigma\).
After tossing out the first half of Markov chain values from the warm-up or burn-in phase, stan_glm()
simulation produces four parallel chains of length 5000 for each model parameter:
\(\{ \beta_0^{(1)}, \beta_0^{(2)}, \dots, \beta_0^{(5000)} \}\), \(\{ \beta_1^{(1)}, \beta_1^{(2)}, \dots, \beta_1^{(5000)} \}\), \(\{ \sigma^{(1)}, \sigma^{(2)}, \dots, \sigma^{(5000)} \}\)
# Trace plots of parallel chains
bayesplot::mcmc_trace(bike_model, size = 0.1)
# Density plots of parallel chains
bayesplot::mcmc_dens_overlay(bike_model)
# Effective sample size ratio
bayesplot::neff_ratio(bike_model)
(Intercept) temp_feel sigma
0.995 0.992 0.950
# Rhat
bayesplot::rhat(bike_model)
(Intercept) temp_feel sigma
1 1 1
# Posterior summary statistics
tidy(bike_model, effects = c("fixed", "aux"),
conf.int = TRUE, conf.level = 0.80)
# A tibble: 4 × 5
term estimate std.error conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2191. 355. -2653. -1735.
2 temp_feel 82.1 5.07 75.7 88.7
3 sigma 1282. 41.1 1231. 1336.
4 mean_PPD 3487. 80.3 3385. 3590.
\[-2196 + 82.2 X\]
The 80% credible interval for \(\beta_1\) is \((75.7, 88.5)\).
Given the data, the probability that \(\beta_1\) is between 75.7 and 88.5 is 80%., i.e., \(P(\beta_1 \in(75.7, 88.5) \mid \mathbf{y}, \mathbf{x}) = 80\%\).
# Store the 4 chains for each parameter in 1 data frame
bike_model_df <- as.data.frame(bike_model)
nrow(bike_model_df)
[1] 20000
head(bike_model_df)
(Intercept) temp_feel sigma
1 -2318 83.0 1304
2 -2107 81.9 1261
3 -2460 85.9 1246
4 -2419 84.7 1394
5 -2067 79.5 1202
6 -1806 75.6 1224
How to obtain \(P(\beta_1 > 0 \mid \mathbf{y}, \mathbf{x})\)?
For each posterior draw \(\{\beta_0^{(t)}, \beta_1^{(t)}, \sigma^{(t)} \}_{t = 1}^{20000}\), we have the posterior predictive distribution \[Y_i^{(t)} \sim N\left(\beta_0^{(t)} + \beta_1^{(t)}X_i, \, (\sigma^{(t)})^2\right)\]
Lasso and Ridge regression can be interpreted as a Bayesian regression model.
The same data-level likelihood \[Y_i \mid \boldsymbol \beta, \sigma \stackrel{ind}{\sim} N\left(\mu_i, \sigma^2 \right) \quad \text{with} \quad \mu_i = \mathbf{x}_i'\boldsymbol \beta\]
We use prior distributions to regularize how parameters behave.
Lasso: \(\beta_j \stackrel{iid}{\sim} \text{Laplace}\left(0, \tau(\lambda)\right)\)
Ridge: \(\beta_j \stackrel{iid}{\sim} N\left(0, \tau(\lambda)\right)\)
Lasso
\[\beta_j \stackrel{iid}{\sim} \text{Laplace}\left(0, \tau(\lambda)\right)\]
Lasso solution is the posterior mode of \(\boldsymbol \beta\)
\[\boldsymbol \beta^{(l)} = \mathop{\mathrm{arg\,max}}_{\boldsymbol \beta} \,\,\, \pi(\boldsymbol \beta\mid \mathbf{y}, \mathbf{x})\]
Ridge
\[\beta_j \stackrel{iid}{\sim} N\left(0, \tau(\lambda)\right)\]
Ridge solution is the posterior mode/mean of \(\boldsymbol \beta\)
\[\boldsymbol \beta^{(r)} = \mathop{\mathrm{arg\,max}}_{\boldsymbol \beta} \,\, \, \pi(\boldsymbol \beta\mid \mathbf{y}, \mathbf{x})\]