Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
## System and data files
*.pdf
*.csv
*.txt
*.DS_Store
lib/
build/
.vscode/
xcode/
*.json
.vs/
cpp_docs/doxyoutput/html
cpp_docs/doxyoutput/xml
cpp_docs/doxyoutput/latex

## R gitignore

Expand Down
2,865 changes: 2,865 additions & 0 deletions cpp_docs/Doxyfile

Large diffs are not rendered by default.

20 changes: 20 additions & 0 deletions cpp_docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
21 changes: 21 additions & 0 deletions cpp_docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# C++ API Documentation

## Building Documentation Locally

We are still working out the best way to deploy the C++ documentation online alongside the R and Python documentation.
In the meantime, to build the C++ documentation locally, first ensure that you have [doxygen](https://www.doxygen.nl/index.html) installed.
On MacOS, this can be [done via homebrew](https://formulae.brew.sh/formula/doxygen) (i.e. `brew install doxygen`).
Next, you will need both the [Sphinx](https://www.sphinx-doc.org/en/master/) and [breathe](https://breathe.readthedocs.io/en/latest/dot_graphs.html) python packages

Now, navigate to the python package's main directory (i.e. `cd [path/to/stochtree]`), build the C++ documentation via `doxygen` and then run `sphinx-build` as below

```
pip install --upgrade pip
pip install -r cpp_docs/requirements.txt
doxygen
sphinx-build -M html cpp_docs/ cpp_docs/build/
```

## Documentation Style

Module (class, function, etc...) documentation follows the format prescribed by [doxygen](https://www.doxygen.nl/manual/docblocks.html) for C++ code.
40 changes: 40 additions & 0 deletions cpp_docs/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Configuration file for the Sphinx documentation builder.
#
# For the full list of built-in configuration values, see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html

# import os
# import sys
# sys.path.insert(0, os.path.abspath('../..'))
from pathlib import Path
CPP_DOC_PATH = Path(__file__).absolute().parent

# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information

project = 'stochtree'
copyright = '2024, Drew Herren'
author = 'Drew Herren'
release = '0.0.1'

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.autosummary',
'breathe'
]

templates_path = ['_templates']
exclude_patterns = []

# Breathe Configuration
breathe_projects = {"StochTree": str(CPP_DOC_PATH / "doxyoutput" / "xml")}
breathe_default_project = "StochTree"

# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

html_theme = 'furo'
html_static_path = ['_static']
42 changes: 42 additions & 0 deletions cpp_docs/dataset.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
Dataset API
===========

Forest Dataset
--------------

The ``ForestDataset`` class is a wrapper around data needed to sample one or more tree ensembles.
Its core elements are

* **Covariates**: Features / variables used to partition the forests. Stored internally as a (column-major) ``Eigen::MatrixXd``.
* **Basis**: *[Optional]* basis vector used to define a "leaf regression" --- a partitioned linear model where covariates define the partitions and basis defines the regression variables.
Also stored internally as a (column-major) ``Eigen::MatrixXd``.
* **Sample Weights**: *[Optional]* case weights for every observation in a training dataset. These may be heteroskedastic variance parameters or simply survey / case weights.
Stored internally as an ``Eigen::VectorXd``.

.. doxygenclass:: StochTree::ForestDataset
:project: StochTree
:members:

Random Effects Dataset
----------------------

The ``RandomEffectsDataset`` class is a wrapper around data needed to sample one or more tree ensembles.
Its core elements are

* **Basis**: Vector of variables that have group-specific random coefficients. In the simplest additive group random effects model, this is a constant intercept of all ones.
Stored internally as a (column-major) ``Eigen::MatrixXd``.
* **Group Indices**: Integer-valued indices of group membership. In a model with three groups, these indices would typically be 0, 1, and 2 (remapped from perhaps more descriptive labels in R or Python).
Stored internally as an ``std::vector`` of integers.
* **Sample Weights**: *[Optional]* case weights for every observation in a training dataset. These may be heteroskedastic variance parameters or simply survey / case weights.
Stored internally as an ``Eigen::VectorXd``.

.. doxygenclass:: StochTree::RandomEffectsDataset
:project: StochTree
:members:

Other Classes and Types
-----------------------

.. doxygenenum:: StochTree::FeatureType
:project: StochTree

10 changes: 10 additions & 0 deletions cpp_docs/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
StochTree C++ API and Implementations
=====================================

This page documents the data structures and interfaces that constitute the ``stochtree`` C++ core.
It may be useful to researchers building novel tree algorithms or users seeking a deeper understanding of the algorithms implemented in ``stochtree``.

.. toctree::
dataset
tracking
tree
35 changes: 35 additions & 0 deletions cpp_docs/make.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
@ECHO OFF

pushd %~dp0

REM Command file for Sphinx documentation

if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=source
set BUILDDIR=build

%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.https://www.sphinx-doc.org/
exit /b 1
)

if "%1" == "" goto help

%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end

:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%

:end
popd
39 changes: 39 additions & 0 deletions cpp_docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
alabaster==0.7.13
Babel==2.15.0
beautifulsoup4==4.12.3
breathe==4.35.0
certifi==2024.2.2
charset-normalizer==3.3.2
docutils==0.20.1
furo==2024.5.6
idna==3.7
imagesize==1.4.1
importlib_metadata==7.1.0
Jinja2==3.1.4
joblib==1.4.2
MarkupSafe==2.1.5
numpy==1.24.4
packaging==24.0
pandas==2.0.3
pybind11==2.12.0
Pygments==2.18.0
python-dateutil==2.9.0.post0
pytz==2024.1
requests==2.32.2
scikit-learn==1.3.2
scipy==1.10.1
six==1.16.0
snowballstemmer==2.2.0
soupsieve==2.5
Sphinx==7.1.2
sphinx-basic-ng==1.0.0b2
sphinxcontrib-applehelp==1.0.4
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==2.0.1
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.5
threadpoolctl==3.5.0
tzdata==2024.1
urllib3==2.2.1
zipp==3.18.2
36 changes: 36 additions & 0 deletions cpp_docs/tracking.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
Forest Sampling Tracker API
===========================

A truly minimalist tree ensemble library only needs

* A representation of a decision tree
* A container for grouping / storing ensembles of trees
* In-memory access to / representation of training data
* Routines / functions to construct the trees

Most algorithms for optimizing or sampling tree ensembles frequently perform the following operations

* Determine which leaf a training observation falls into for a decision tree (to compute its prediction and update the residual / outcome)
* Evaluate potential split candidates for a leaf of a decision

With only the "minimalist" tools above, these two tasks proceed largely as follows

* For every observation in the dataset, traverse the tree (runtime depends on the tree topology but in a fully balanced tree with :math:`k` nodes, this has time complexity :math:`O(\log (k))`).
* For every observation in the dataset, determine whether an observation falls into a given node and whether or not a proposed decision rule would be true

These operations both perform unnecessary computation which can be avoided with some additional real-time tracking. Essentially, we want

1. A mapping from dataset row index to leaf node id for every tree in an ensemble (so that we can skip the tree traversal during prediction)
2. A mapping from leaf node id to dataset row indices every tree in an ensemble (so that we can skip the full pass through the training data at split evaluation)

.. 1. For every observation in a dataset, which leaf node of each tree does the sample fall into?
.. 2. For every leaf in a tree, which training set observations fall into that node?

Forest Tracker
--------------

The ``ForestTracker`` class is a wrapper around several implementations of the mappings discussed above.

.. doxygenclass:: StochTree::ForestTracker
:project: StochTree
:members:
20 changes: 20 additions & 0 deletions cpp_docs/tree.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
Decision Tree API
=================

Tree
----

The fundamental building block of the C++ tree interface is the ``Tree`` class.

.. doxygenclass:: StochTree::Tree
:project: StochTree
:members:

Tree Split
----------

Numeric and categorical splits are represented by a ``TreeSplit`` class.

.. doxygenclass:: StochTree::TreeSplit
:project: StochTree
:members:
Loading