Custom Decision Trees is a Python package that lets you build Decision Trees / Random Forests models with advanced configuration :
Define your own cutting criteria in Python language (documentation in the following sections).
This feature is particularly useful in "cost-dependent" scenarios. Examples:
- Trading Movements Classification: When the goal is to maximize economic profit, the metric can be set to economic profit, optimizing tree splitting accordingly.
- Churn Prediction: To minimize false negatives, metrics like F1 score or recall can guide the splitting process.
- Fraud Detection: Splitting can be optimized based on the proportion of fraudulent transactions identified relative to the total, rather than overall classification accuracy.
- Marketing Campaigns: The splitting can focus on maximizing expected revenue from customer segments identified by the tree.
Allow trees to split nodes with one or more simultaneous conditions.
Example of multi-condition splitting on the Titanic dataset:
- Supports classification and regression
- Supports multiclass classification
- Supports standard decision tree parameters (max_depth, min_samples_split, max_features, n_estimators, etc.)
- Supports STRING type explanatory variables
- Ability to control the number of variable splitting options when optimizing a split (i.e
nb_max_cut_options_per_var
parameter). - Ability to control the maximum number of splits to be tested per node to avoid overly long calculations in multi-condition mode (i.e
nb_max_split_options_per_node
parameters) - Possibility of parallelizing calculations (i.e
n_jobs
parameters)
Splitting in a decision tree is achieved by optimizing a metric. For example, Gini optimization consists in maximizing the
- The Gini Index represents the impurity of a group of observations based on the observations of each class (0 and 1):
- The metric to be maximized is
$\Delta_{Gini}$ , the difference between the Gini index on the parent node and the weighted average of the Gini index between the two child nodes ($L$ and$R$ ).
At each node, the tree algorithm finds the split that minimizes
See
./notebooks/
folder for complete examples.
pip install custom-decision-trees
To integrate a specific measure, the user must define a class containing the compute_metric
and compute_delta
methods, then insert this class into the classifier.
Example of a class with the Gini index :
import numpy as np
from custom_decision_trees.metrics import MetricBase
class Gini(MetricBase):
def __init__(
self,
n_classes: int = 2,
) -> None:
self.n_classes = n_classes
self.max_impurity = 1 - 1 / n_classes
def compute_gini(
self,
metric_data: np.ndarray,
) -> float:
y = metric_data[:, 0]
nb_obs = len(y)
if nb_obs == 0:
return self.max_impurity
props = [(np.sum(y == i) / nb_obs) for i in range(self.n_classes)]
metric = 1 - np.sum([prop**2 for prop in props])
return float(metric)
def compute_metric(
self,
metric_data: np.ndarray,
mask: np.ndarray,
):
gini_parent = self.compute_gini(metric_data)
gini_side1 = self.compute_gini(metric_data[mask])
gini_side2 = self.compute_gini(metric_data[~mask])
delta = (
gini_parent -
gini_side1 * np.mean(mask) -
gini_side2 * (1 - np.mean(mask))
)
metadata = {"gini": round(gini_side1, 3)}
return float(delta), metadata
Once you have instantiated the model with your custom metric, all you have to do is use the .fit
and .predict_proba
methods:
from custom_decision_trees import DecisionTreeClassifier
gini = Gini()
decision_tree = DecisionTreeClassifier(
metric=gini,
max_depth=2,
nb_max_conditions_per_node=2 # Set to 1 for a traditional decision tree
)
decision_tree.fit(
X=X_train,
y=y_train,
metric_data=metric_data,
)
probas = model.predict_probas(
X=X_test
)
probas[:5]
>>> array([[0.75308642, 0.24691358],
[0.36206897, 0.63793103],
[0.75308642, 0.24691358],
[0.36206897, 0.63793103],
[0.90243902, 0.09756098]])
You can also display the decision tree, with the values of your metrics, using the print_tree
method:
decision_tree.print_tree(
feature_names=features,
metric_name="MyMetric",
)
>>> [0] 712 obs -> MyMetric = 0.0
| [1] (x["Sex"] <= 0.0) AND (x["Pclass"] <= 2.0) | 157 obs -> MyMetric = 0.16
| | [3] (x["Age"] <= 2.0) AND (x["Fare"] > 26.55) | 1 obs -> MyMetric = 0.01
| | [4] (x["Age"] > 2.0) OR (x["Fare"] <= 26.55) | 156 obs -> MyMetric = 0.01
| [2] (x["Sex"] > 0.0) OR (x["Pclass"] > 2.0) | 555 obs -> MyMetric = 0.16
| | [5] (x["SibSp"] <= 2.0) AND (x["Age"] <= 8.75) | 27 obs -> MyMetric = 0.05
| | [6] (x["SibSp"] > 2.0) OR (x["Age"] > 8.75) | 528 obs -> MyMetric = 0.05
decision_tree.plot_tree(
feature_names=features,
metric_name="delta gini",
)
Same with Random Forest Classifier :
from custom_decision_trees import RandomForestClassifier
random_forest = RandomForest(
metric=gini,
n_estimators=10,
max_depth=2,
nb_max_conditions_per_node=2,
)
random_forest.fit(
X=X_train,
y=y_train,
metric_data=metric_data
)
probas = random_forest.predict_probas(
X=X_test
)
The "regression" mode is used in exactly the same way as "classification", i.e., by specifying the metric from a Python class.
your_metric = YourMetric()
decision_tree_regressor = DecisionTreeRegressor(
metric=your_metric,
max_depth=2,
min_samples_split=2,
min_samples_leaf=1,
max_features=None,
nb_max_conditions_per_node=2,
nb_max_cut_options_per_var=10,
n_jobs=1
)
decision_tree_regressor.fit(
X=X,
y=y,
metric_data=metric_data,
)
See /notebooks
folder for complete examples.