Skip to content

Conversation

@jesper-olsen
Copy link

Hi Michael,

Thank you for creating this excellent educational repository for decision trees.

As I was working with the code, I made some updates to modernise it, improve its usability, and add proper documentation. I wanted to offer these changes back to the original project in case they are useful to you or to future learners who find this repo.

Here is a summary of the improvements:

Code Modernization & Refactoring:

  • Refactored uniqueCounts, entropy, and gini to use the more efficient and Pythonic collections.Counter and math.log2.
  • Added guard clauses to entropy and gini to prevent ZeroDivisionError.
  • Improved overall code formatting, readability, and added docstrings to all major functions.

New Features & Usability:

  • Command-Line Interface: Integrated argparse to allow users to select the dataset and criterion (gini/entropy) from the command line.
  • Graphviz Plotting: Added a --plot option to generate a tree visualisation and save it to file (e.g. .png format).
  • Improved Classification Output: Created a dedicated function to print classification results in a clear, human-readable, and deterministic format.

Project Structure & Documentation:

  • New README: Added a comprehensive README.md with instructions on prerequisites, installation, and usage.
  • Better File Naming: Renamed the main script from implementation.py to decision_tree.py to follow standard Python conventions.
  • Organized Directory Structure: Moved data files (.csv) into a data/ directory to separate code from data.

No pressure to merge this, as I know the project is not actively maintained, but I wanted to offer these improvements back. Thanks again for the great resource!

The member variable `results` in the `DecisionNode` class was ambiguous and required a comment to explain its purpose.

This commit renames it to `class_counts` to be more descriptive and self-documenting, reflecting that it holds the distribution of training samples in a leaf node. The conditional checks have also been updated to be more idiomatic (e.g., `if node.class_counts:`).
Improves the core algorithm's efficiency by refactoring the impurity
and pruning calculations to operate on Counter objects directly,
avoiding the creation of large temporary data lists.

- Adopts PEP 8 naming conventions and adds type hints.
- Fixes a potential ZeroDivisionError in missing data classification.
- Simplifies pre-pruning logic in the tree growth algorithm.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant