Abhishek Kumar - 10 months ago

Boosted decision tree is very popular among Kaggle competition winners and know for high accuracy for classification problems. Gradient boosting is a machine learning technique that produces a prediction model in the form of an ensemble of weak classifiers, optimizing a differentiable loss function.

Recently, Cortana Intelligence and ML Blog Team published their comparative results on their blog and provided the details of their testing. The key different remains in how the tree are build and ensembled. There are two different strategies to compute the trees: level-wise and leaf-wise. The level-wise strategy grows the tree level by level. In this strategy, each node splits the data prioritizing the nodes closer to the tree root. The leaf-wise strategy grows the tree by splitting the data at the nodes with the highest loss change. Level-wise growth is usually better for smaller datasets whereas leaf-wise tends to overfit. Leaf-wise growth tends to excel in larger datasets where it is considerably faster than level-wise growth.


Level-wise growth strategy vs leaf-wise growth strategy. The level-wise strategy adds complexity extending the depth of the tree level by level. As a contrary, the leaf-wise strategy generates branches by optimizing a loss.

A key challenge in training boosted decision trees is the computational cost of finding the best split for each leaf. Conventional techniques find the exact split for each leaf, and require scanning through all the data in each iteration. A different approach approximates the split by building histograms of the features. That way, the algorithm doesn’t need to evaluate every single value of the features to compute the split, but only the bins of the histogram, which are bounded. This approach turns out to be much more efficient for large datasets, without adversely affecting accuracy.

XGBoost started in 2014, and it has become popular due to its use in many winning Kaggle competition entries. Originally XGBoost was based on a level-wise growth algorithm, but recently has added an option for leaf-wise growth that implements split approximation using histograms.

The team performed many experiments, and found XGBoost and LightGBM had similar accuracy metrics (F1-scores are shown here), so we focused on training times in this blog post. The table below shows training times and the training time ratios between the two libraries in both CPU and GPU implementations.


Benchmark of XGBoost vs LightGBM training times and training time ratios between the libraries in CPU and GPU. In the situations where XGBoost GPU training time does not appear (-), we got an out of memory error. In the Airline dataset with XGBoost in CPU (-*), we stopped the computation after 5 hours. The best training time for each dataset is in boldface text.

The result from their study shows that:

  1. XGBoost and LightGBM achieve similar accuracy metrics.
  2. LightGBM has lower training time than XGBoost and its histogram-based variant, XGBoost hist, for all test datasets, on both CPU and GPU implementations. The training time difference between the two libraries depends on the dataset, and can be as big as 25 times.
  3. XGBoost GPU implementation does not scale well to large datasets and ran out of memory in half of the tests.
  4. XGBoost hist may be significantly slower than the original XGBoost when feature dimensionality is high.

Note: These are experiment results as reported at https://blogs.technet.microsoft.com/machinelearning/2017/07/25/lessons-learned-benchmarking-fast-machine-learning-algorithms/

and yet to be analysed and peer-reviewed by data science community.