MSSC 6250 Statistical Machine Learning
Can be used for regression and classification.
IDEA: Segmenting the predictor space into many simple regions.
Simple, useful for interpretation, and has nice graphical representation.
Not competitive with the best supervised learning approaches in terms of prediction accuracy. (Large bias)
Combining a large number of trees (ensembles) often results in improvements in prediction accuracy, at the expense of some loss interpretation.
CART is a nonparametric method that recursively partitions the feature space into hyper-rectangular subsets (boxes), and make prediction on each subset.
Divide the predictor space — the set of possible values for \(X_1, X_2, \dots, X_p\) — into \(J\) distinct and non-overlapping regions, \(R_1, R_2, \dots, R_J\).
KNN requires K and a distance measure.
SVM requires kernels.
Tree solves this by recursively partitioning the feature space using a binary splitting rule \(\mathbf{1}\{x \le c \}\)
0: Red; 1: Blue
If \(x_2 < -0.64\), \(y = 0\).
If \(x_2 \ge -0.64\) and \(x_1 \ge 0.69\), \(y = 0\).
If \(x_2 \ge -0.64\), \(x_1 < 0.69\), and \(x_2 \ge 0.75\), \(y = 0\).
If \(x_2 \ge -0.64\), \(x_1 < 0.69\), \(x_2 < 0.75\), and \(x_1 < -0.69\), \(y = 0\).
Step 5 may not be beneficial.
Step 6 may not be beneficial. (Could overfit)
Step 7 may not be beneficial. (Could overfit)
The classification error rate is the fraction of the training observations in the region that do not belong to the most common class: \[1 - \max_{k} (\hat{p}_{mk})\] where \(\hat{p}_{mk}\) is the proportion of training observations in the \(m\)th region that are from the \(k\)th class.
It is not sensitive for tree-growing.
Hope to have nodes (regions) including training points that belong to only one class.
The Gini index is defined by
\[\sum_{k=1}^K \hat{p}_{mk}(1 - \hat{p}_{mk})\] which is a measure of total variance across the \(K\) classes.
Gini is small if all of the \(\hat{p}_{mk}\)s are close to zero or one.
Node purity: a small value indicates that a node contains predominantly observations from a single class.
The Shannon entropy is defined as
\[- \sum_{k=1}^K \hat{p}_{mk} \log(\hat{p}_{mk}).\]
The entropy is near zero if the \(\hat{p}_{mk}\)s are all near zero or one.
Gini index and the entropy are similar numerically.
The goal is to find boxes \(R_1, \dots ,R_J\) that minimize the \(SS_{res}\), given by \[\sum_{j=1}^J\sum_{i \in R_j}\left( y_i - \hat{y}_{R_j}\right)^2\] where \(\hat{y}_{R_j}\) is the mean response for the training observations within \(R_j\).
Given the largest tree \(T_{max}\),
\[\begin{align} \min_{T \subset T_{max}} \sum_{m=1}^{|T|}\sum_{i:x_i\in R_m} \left( y_i - \hat{y}_{R_m}\right)^2 + \alpha|T| \end{align}\] where \(|T|\) indicates the number of terminal nodes of the tree \(T\).
Large \(\alpha\) results in small trees
Choose \(\alpha\) using CV
Algorithm 8.1 in ISL for building a regression tree.
For classification, replace \(SS_{res}\) with a classification performance measure.
Linear regression
\[f(X) = \beta_0 + \sum_{j=1}^pX_j\beta_j\]
Regression tree
\[f(X) = \sum_{j=1}^J \hat{y}_{R_j}\mathbf{1}(\mathbf{X}\in R_j)\]
Performs better when there is a highly nonlinear and complex relationship between \(y\) and \(x\).
Preferred for interpretability and visualization.
Two heads are better than one, not because either is infallible, but because they are unlikely to go wrong in the same direction. – C.S. Lewis, British Writer (1898 - 1963)
『三個臭皮匠,勝過一個諸葛亮』
An ensemble method combines many weak learners (unstable, less accurate) to obtain a single and powerful model.
The CARTs suffer from high variance.
If independent \(Z_1, \dots, Z_n\) have variance \(\sigma^2\), then \(\bar{Z}\) has variance \(\sigma^2/n\).
Averaging a set of observations reduces variance!
With \(B\) separate training sets,
\[\hat{f}_{avg}(x) = \frac{1}{B}\sum_{b=1}^B\hat{f}_{b}(x)\]
\[\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B\hat{f}^*_{b}(x)\]
For CART, the decision line has to be aligned to axis.
For Bagging, \(B = 200\) each having 400 training points. Boundaries are smoother.
Using a large \(B\) will not lead to overfitting.
Use \(B\) sufficiently large that the error has settled down.
Bagging improves prediction accuracy at the expense of interpretability.
When different trees are highly correlated, simply averaging is not very effective.
The predictions from the bagged trees will be highly correlated, and hence averaging does not lead to as large reduction in variance.
Random forests improve bagged trees by decorrelating the trees.
\(m\) predictors are randomly sampled as split candidates from the \(p\) predictors.
\(m \approx \sqrt{p}\) for classification; \(m \approx p/3\) for regression.
Decorrelating: on average \((p − m)/p\) of the splits will not even consider the strong predictor, and so other predictors will have more of a chance.
If \(m = p\), random forests = bagging.
The improvement is significant when \(p\) is large.
randomForest::randomForest(x, y, mtry, ntree, nodesize, sampsize)
Bagging
Trees are built on independent bootstrap data sets.
Trees are grown deep.
Large number of trees (\(B\)) won’t overfit.
Boosting
Trees are grown sequentially: each tree is grown using information from previously grown trees.
Each tree is fit on a modified version of the original data set, the residuals/false predictions!
Trees are rather small (weak learner).
Large \(B\) can overfit.
distribution = "bernoulli"
: LogitBoost
gbm.fit = gbm::gbm(y ~ ., data = data.frame(x1, x2, y),
distribution = "bernoulli",
n.trees = 10000, shrinkage = 0.01, bag.fraction = 0.6,
interaction.depth = 2, cv.folds = 10)
gbm.perf(gbm.fit, method = "cv")
[1] 1181
gbm.fit <- gbm::gbm(y ~ x, data = data.frame(x, y),
distribution = "gaussian", n.trees = 300,
shrinkage = 0.5, bag.fraction = 0.8, cv.folds = 10)
AdaBoost (Adaptive Boosting) gbm(y ~ ., distribution = "adaboost")
Gradient Boosting/Extreme Gradient Boosting (XGBoost) xgboost
Bayesian Additive Regression Trees (BART) (ISL Sec. 8.2.4)