CatBoost — A new game of Machine Learning

Gradient Boosted Decision Trees and Random Forest are one of the best ML models for tabular heterogeneous datasets.

CatBoost is an algorithm for gradient boosting on decision trees. Developed by Yandex researchers and engineers, it is the successor of the MatrixNet algorithm that is widely used within the company for ranking tasks, forecasting and making recommendations. It is universal and can be applied across a wide range of areas and to a variety of problems.

CatBoost, the new kid on the block, has been around for a little more than a year now, and it is already threatening XGBoost, LightGBM and H2O.

Why CatBoost?

Better Results

GBDT Algorithms Benchmark

Faster Predictions

Left: CPU, Right: GPU

Batteries Included

GBDT Algorithms with default parameters Benchmark

Some more noteworthy advancements by Catboost are the features interactions, object importance and the snapshot support.In addition to classification and regression, Catboost supports ranking out of the box.

Battle Tested

The Algorithm

Classic Gradient Boosting

Gradient Boosting on Wikipedia

CatBoost Secret Sauce

Categorical Feature Handling

Ordered Target Statistic

To fight this prediction shift CatBoost uses a more effective strategy. It relies on the ordering principle and is inspired by online learning algorithms which get training examples sequentially in time. In this setting, the values of TS for each example rely only on the observed history.
To adapt this idea to a standard offline setting, Catboost introduces an artificial “time” — a random permutation σ1 of the training examples.
Then, for each example, it uses all the available “history” to compute its Target Statistic.
Note that, using only one random permutation, results in preceding examples with higher variance in Target Statistic than subsequent ones. To this end, CatBoost uses different permutations for different steps of gradient boosting.

One Hot Encoding

CatBoost’s Secret Sauce

Ordered Boosting

Catboost Ordered Boosting and Tree Building

In order to avoid prediction shift, Catboost uses permutations such that σ1 = σ2. This guarantees that the target-yi is not used for training Mi neither for the Target Statistic calculation nor for the gradient estimation.

Tuning Catboost

Important Parameters

one_hot_max_size — As mentioned before, Catboost uses a one-hot encoding for all features with at most one_hot_max_size unique values. In our case, the categorical features have a lot of unique values, so we will not use one hot encoding, but depending on the dataset it may be a good idea to adjust this parameter.

learning_rate & n_estimators — The smaller the learning_rate, the more n_estimators needed to utilize the model. Usually, the approach is to start with a relative high learning_rate, tune other parameters and then decrease the learning_rate while increasing n_estimators.

max_depth — Depth of the base trees, this parameter has a high impact on training time.

subsample — Sample rate of rows, cannot be used in a Bayesian boosting type setting.

colsample_bylevel, colsample_bytree, colsample_bynode— Sample rate of columns.

l2_leaf_reg — L2 regularization coefficient

random_strength Every split gets a score and random_strength is adding some randomness to the score, it helps to reduce overfitting.

Check out the recommended spaces for tuning here

Model Exploration with Catboost

Catboost’s Feature Importance
Catboost’s Feature Interactions
Catboost’s Object Importance
SHAP values can be used for other ensembles as well

Not only does it build one of the most accurate model on whatever dataset you feed it with — requiring minimal data prep — CatBoost also gives by far the best open source interpretation tools available today AND a way to productionize your model fast.

That is why CatBoost is revolutionising the game of Machine Learning, forever. And that is why learning to use it is a fantastic opportunity to up-skill and remain relevant as a data scientist. But more interestingly, CatBoost poses a threat to the status quo of the data scientist (like me) who enjoys a position where it is supposedly tedious to build a highly accurate model given a dataset. CatBoost is changing that. It is making highly accurate modeling accessible to everyone.

Image taken from CatBoost official documentation:

Building highly accurate models at blazing speeds

Installing CatBoost on the other end is a piece of cake. Just run

pip install catboost

Data prep needed

Unlike most Machine Learning models available today, CatBoost requires minimal data preparation. It handles:

· Missing values for Numeric variables

· Non encoded Categorical variables.
Note missing values have to be filled beforehand for Categorical variables. Common approaches replace NAs with a new category ‘missing’ or with the most frequent category.

· For GPU users only, it does handle Text variables as well.
Unfortunately I could not test this feature as I am working on a laptop with no GPU available. [EDIT: a new upcoming version will handle Text variables on CPU. See comments for more info from the head of CatBoost team.]

Building models

As with XGBoost, you have the familiar sklearn syntax with some additional features specific to CatBoost.

from catboost import CatBoostClassifier # Or CatBoostRegressor
model_cb = CatBoostClassifier(), y_train)

Or if you want a cool sleek visual about how the model learns and whether it starts overfitting, use plot=True and insert your test set in the eval_set parameter:

from catboost import CatBoostClassifier # Or CatBoostRegressor
model_cb = CatBoostClassifier(), y_train, plot=True, eval_set=(X_test, y_test))

Note that you can display multiple metrics at the same time, even more human-friendly metrics like Accuracy or Precision. Supported metrics are listed here. See example below:

Monitoring both Logloss and AUC at training time on both training and test sets

You can even use cross-validation and observe the average & standard deviation of accuracies of your model on the different splits:


CatBoost is quite similar to XGBoost. To fine-tune the model appropriately, first set the early_stopping_rounds to a finite number (like 10 or 50) and start tweaking the model’s parameters.

This Blog is written by Anamika Jha, Business Analyst, Affine.

Affine is a provider of analytics solutions, working with global organizations solving their strategic and day to day business problems