In machine learning we are given a set of data points and goal is to create a function that draws nice curve through the data points. We call that function a model and it maps $X$ to $y$, thus, making predictions given some unknown $x$. Adding up a bunch of subfunctions to create a composite function is called additive modeling. Gradient boosting machines use additive modeling to gradually nudge an approximate model towards a really good model, by adding simple submodels to a composite model.
Boosting
Boosting combines multiple weak models sequentially into a single composite model. The idea is that, as we introduce more weak models, the overall model becomes a stronger and stronger predictor.
Gradient boosting machines (GBM)
Lets consider boosting approach as a golf play. Golfer is initially hitting a ball towards the hole at $y$ but only getting as far as $f_0(X)$. The golfer then repeatedly taps the ball more softly, working the ball towards the hole, after reassessing direction and distance to the hole at each stage.
GBM for regression pseudocode
- Train first model $f_0(X)$ to learn target $y$ given $X$. It predicts average of $y$ values in leaf nodes.
- Compute the difference between target $y$ and the prediction of first model, $y - f_0(X)$ it is called residual or direction vector.
- Train next weak model $\Delta_1(X)$ on residual of previous models i.e. $y - f_0(X)$
- The new prediction model $F_1(X)$ is the addition of previous model and a nudge, $\Delta_1(X)$, multiplied by learning rate $\eta$ $$F_1(X) = f_0(X) + \eta \Delta_1(X)$$
- Likewise train every subsequent weak model $\Delta_m(X)$ on residual of previous models i.e. $y - F_{m-1}(X)$
- Adding together all weak models with initial $f_0(X)$ gives a strong composite model $F_M(X)$.
$$\begin{align}
\hat y & = f_0(X) + \eta \Delta_1(X) + \eta \Delta_2(X) + … + \eta \Delta_M(X) \
& = f_0(X) + \eta \sum_{m=1}^M \Delta_m(X) \
& = F_M(X) \
\end{align}$$
Or using recurrence
$$F_0(X) = f_0(X)$$
$$F_m(X) = F_{m-1}(X) + \eta \Delta_m(X)$$
Two common direction vector choices
When $\Delta_m(X)$ models are trained on residual, $y - F_{m-1}(X)$, the overall model gets optimized for squared error loss function.
When $\Delta_m(X)$ models are trained on sign, $sign (y - F_{m-1}(X))$, the overall model gets optimized for absolute error loss function.
Hyper-parameters
number of stages $M$ - The more stages we use, the more accurate the model, but the more likely we are to be overfitting.
learning rate $\eta$ - The primary purpose of the learning rate, or “shrinkage” as some papers call it, is to reduce overfitting of the overall model.
As Chen and Guestrin say in XGBoost: A Scalable Tree Boosting System, “shrinkage reduces the influence of each individual tree and leaves space for future trees to improve the model.”