In machine learning we are given a set of data points and goal is to create a function that draws nice curve through the data points. We call that function a model and it maps $X$ to $y$, thus, making predictions given some unknown $x$. Adding up a bunch of subfunctions to create a composite function is called additive modeling. Gradient boosting machines use additive modeling to gradually nudge an approximate model towards a really good model, by adding simple submodels to a composite model.

Boosting

Boosting combines multiple weak models sequentially into a single composite model. The idea is that, as we introduce more weak models, the overall model becomes a stronger and stronger predictor.

Gradient boosting machines (GBM)

PNG

Lets consider boosting approach as a golf play. Golfer is initially hitting a ball towards the hole at $y$ but only getting as far as $f_0(X)$. The golfer then repeatedly taps the ball more softly, working the ball towards the hole, after reassessing direction and distance to the hole at each stage.

GBM for regression pseudocode

  1. Train first model $f_0(X)$ to learn target $y$ given $X$. It predicts average of $y$ values in leaf nodes.
  2. Compute the difference between target $y$ and the prediction of first model, $y - f_0(X)$ it is called residual or direction vector.
  3. Train next weak model $\Delta_1(X)$ on residual of previous models i.e. $y - f_0(X)$
  4. The new prediction model $F_1(X)$ is the addition of previous model and a nudge, $\Delta_1(X)$, multiplied by learning rate $\eta$ $$F_1(X) = f_0(X) + \eta \Delta_1(X)$$
  5. Likewise train every subsequent weak model $\Delta_m(X)$ on residual of previous models i.e. $y - F_{m-1}(X)$
  6. Adding together all weak models with initial $f_0(X)$ gives a strong composite model $F_M(X)$.

$$\begin{align} \hat y & = f_0(X) + \eta \Delta_1(X) + \eta \Delta_2(X) + … + \eta \Delta_M(X) \
& = f_0(X) + \eta \sum_{m=1}^M \Delta_m(X) \
& = F_M(X) \
\end{align}$$

Or using recurrence

$$F_0(X) = f_0(X)$$

$$F_m(X) = F_{m-1}(X) + \eta \Delta_m(X)$$

Two common direction vector choices

  • When $\Delta_m(X)$ models are trained on residual, $y - F_{m-1}(X)$, the overall model gets optimized for squared error loss function.

  • When $\Delta_m(X)$ models are trained on sign, $sign (y - F_{m-1}(X))$, the overall model gets optimized for absolute error loss function.

Hyper-parameters

  • number of stages $M$ - The more stages we use, the more accurate the model, but the more likely we are to be overfitting.

  • learning rate $\eta$ - The primary purpose of the learning rate, or “shrinkage” as some papers call it, is to reduce overfitting of the overall model.

As Chen and Guestrin say in XGBoost: A Scalable Tree Boosting System, “shrinkage reduces the influence of each individual tree and leaves space for future trees to improve the model.”

References