-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Closed
Closed
Copy link
Labels
Description
Description
When using categorical features with enable_categorical=True
, XGBoost currently always routes unseen/absent levels to the left child in decision tree splits.
This behavior introduces prediction biases, matching the "absent levels problem" described in Au (2018), Random Forests, Decision Trees, and Categorical Predictors: The "Absent Levels" Problem (JMLR, 2018).
Minimal Reproducible Example
The bias appears whenever test-time categories were not present when a split was decided.
import numpy as np
import xgboost as xgb
from sklearn.metrics import r2_score
# -------------------------------
# Create synthetic train/test data
# -------------------------------
# Training categories: 0 and 1 only
X_train = np.array([0]*50 + [1]*50).reshape(-1, 1)
y_train = np.array([5.0]*50 + [-5.0]*50)
# Test categories: 1 (seen) and 42 (unseen)
X_test = np.array([1]*50 + [42]*50).reshape(-1, 1)
# True targets: category 1 -> -5, category 42 -> +5 (unseen behaves like category 0)
y_test = np.array([-5.0]*50 + [5.0]*50)
# Wrap into DMatrix, marking the single column as categorical
dtrain = xgb.DMatrix(X_train, label=y_train, feature_types=['c'], enable_categorical=True)
dtest = xgb.DMatrix(X_test, label=y_test, feature_types=['c'], enable_categorical=True)
# -------------------------------
# Train model
# -------------------------------
params = {
"tree_method": "hist",
"objective": "reg:squarederror",
"seed": 0,
}
model = xgb.train(params, dtrain)
# -------------------------------
# Evaluate
# -------------------------------
y_pred = model.predict(dtest)
print("Test R2:", r2_score(y_test, y_pred))
Observed behavior:
- Test R2 is negative (-0.93).
- Unseen category
42
is routed to the same side as category1
(always left), leading to biased predictions.
Expected behavior:
- Test R2 should be positive.
- XGBoost should provide a better strategy for absent/unseen categorical levels instead of always sending them left.
- In line with Au (2018)’s conclusion, a good candidate is the Random heuristic: sending unseen levels left or right at random, weighted by the training sample sizes in each child node.
While the issue is illustrated here with a synthetic dataset, it was originally spotted on a real-world dataset with multiple high-cardinality categorical features, where XGBoost produced an extreme negative test R2 of -163 due to this problem.