Absent levels in categorical splits are always sent left, causing prediction biases

## Description
When using categorical features with `enable_categorical=True`, XGBoost currently **always routes unseen/absent levels to the left child** in decision tree splits.

This behavior introduces prediction biases, matching the "absent levels problem" described in Au (2018), *Random Forests, Decision Trees, and Categorical Predictors: The "Absent Levels" Problem* ([JMLR, 2018](http://jmlr.org/papers/v19/16-474.html)).

## Minimal Reproducible Example

The bias appears whenever test-time categories were not present when a split was decided.

```python
import numpy as np
import xgboost as xgb
from sklearn.metrics import r2_score

# -------------------------------
# Create synthetic train/test data
# -------------------------------

# Training categories: 0 and 1 only
X_train = np.array([0]*50 + [1]*50).reshape(-1, 1)
y_train = np.array([5.0]*50 + [-5.0]*50)

# Test categories: 1 (seen) and 42 (unseen)
X_test = np.array([1]*50 + [42]*50).reshape(-1, 1)
# True targets: category 1 -> -5, category 42 -> +5 (unseen behaves like category 0)
y_test = np.array([-5.0]*50 + [5.0]*50)

# Wrap into DMatrix, marking the single column as categorical
dtrain = xgb.DMatrix(X_train, label=y_train, feature_types=['c'], enable_categorical=True)
dtest  = xgb.DMatrix(X_test,  label=y_test,  feature_types=['c'], enable_categorical=True)

# -------------------------------
# Train model
# -------------------------------
params = {
    "tree_method": "hist",
    "objective": "reg:squarederror",
    "seed": 0,
}

model = xgb.train(params, dtrain)

# -------------------------------
# Evaluate
# -------------------------------
y_pred = model.predict(dtest)

print("Test R2:", r2_score(y_test, y_pred))
```

**Observed behavior**:
- Test R2 is negative (-0.93).
- Unseen category `42` is routed to the same side as category `1` (always left), leading to biased predictions.

**Expected behavior**:
- Test R2 should be positive.
- XGBoost should provide a better strategy for absent/unseen categorical levels instead of always sending them left.
- In line with Au (2018)’s conclusion, a good candidate is the **Random heuristic**: sending unseen levels left or right at random, weighted by the training sample sizes in each child node.

While the issue is illustrated here with a synthetic dataset, it was originally spotted on a real-world dataset with multiple high-cardinality categorical features, where XGBoost produced an extreme negative test R2 of -163 due to this problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Absent levels in categorical splits are always sent left, causing prediction biases #11659

Description

Minimal Reproducible Example

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Absent levels in categorical splits are always sent left, causing prediction biases #11659

Description

Description

Minimal Reproducible Example

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions