Skip to content

Absent levels in categorical splits are always sent left, causing prediction biases #11659

@ldesreumaux

Description

@ldesreumaux

Description

When using categorical features with enable_categorical=True, XGBoost currently always routes unseen/absent levels to the left child in decision tree splits.

This behavior introduces prediction biases, matching the "absent levels problem" described in Au (2018), Random Forests, Decision Trees, and Categorical Predictors: The "Absent Levels" Problem (JMLR, 2018).

Minimal Reproducible Example

The bias appears whenever test-time categories were not present when a split was decided.

import numpy as np
import xgboost as xgb
from sklearn.metrics import r2_score

# -------------------------------
# Create synthetic train/test data
# -------------------------------

# Training categories: 0 and 1 only
X_train = np.array([0]*50 + [1]*50).reshape(-1, 1)
y_train = np.array([5.0]*50 + [-5.0]*50)

# Test categories: 1 (seen) and 42 (unseen)
X_test = np.array([1]*50 + [42]*50).reshape(-1, 1)
# True targets: category 1 -> -5, category 42 -> +5 (unseen behaves like category 0)
y_test = np.array([-5.0]*50 + [5.0]*50)

# Wrap into DMatrix, marking the single column as categorical
dtrain = xgb.DMatrix(X_train, label=y_train, feature_types=['c'], enable_categorical=True)
dtest  = xgb.DMatrix(X_test,  label=y_test,  feature_types=['c'], enable_categorical=True)

# -------------------------------
# Train model
# -------------------------------
params = {
    "tree_method": "hist",
    "objective": "reg:squarederror",
    "seed": 0,
}

model = xgb.train(params, dtrain)

# -------------------------------
# Evaluate
# -------------------------------
y_pred = model.predict(dtest)

print("Test R2:", r2_score(y_test, y_pred))

Observed behavior:

  • Test R2 is negative (-0.93).
  • Unseen category 42 is routed to the same side as category 1 (always left), leading to biased predictions.

Expected behavior:

  • Test R2 should be positive.
  • XGBoost should provide a better strategy for absent/unseen categorical levels instead of always sending them left.
  • In line with Au (2018)’s conclusion, a good candidate is the Random heuristic: sending unseen levels left or right at random, weighted by the training sample sizes in each child node.

While the issue is illustrated here with a synthetic dataset, it was originally spotted on a real-world dataset with multiple high-cardinality categorical features, where XGBoost produced an extreme negative test R2 of -163 due to this problem.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions