optimize CPU inference with Array-Based Tree Traversal #11519

razdoburdin · 2025-06-20T13:50:25Z

This PR introduces optimization for CPU inference. For each tree, the top N levels are transformed into a compact array-based layout. This allows for a branchless node indexing rule: idx = 2 * idx + int(val < split_cond). To minimize memory overhead, this transformation from the standard tree structure to the array layout is performed on-the-fly for each block of data being processed. Even with this additional calculations, improved data locality in the cache-friendly array layout leads to inference speed up to ~2x (x1.4 on average).

trivialfis · 2025-06-21T01:33:21Z

Thank you for the optimization on the inference. Please unmark the "draft" status and ping me when the PR is ready for testing.

src/predictor/array_tree_layout.h

Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>

…rdin/xgboost into dev/cpu/eytzinger_layout

Vika-F

Cosmetic changes.

The next possible step would be to convert the trees into array-based representation only once, and not to do it for each block of data.

src/predictor/array_tree_layout.h

src/predictor/cpu_predictor.cc

Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>

razdoburdin · 2025-06-24T11:52:19Z

The next possible step would be to convert the trees into array-based representation only once, and not to do it for each block of data.

it sounds reasonable and will further improve perf (by cost of increasing memory consumption).

razdoburdin · 2025-06-24T12:25:34Z

Thank you for the optimization on the inference. Please unmark the "draft" status and ping me when the PR is ready for testing.

hi @trivialfis, the PR is ready for review.

trivialfis · 2025-07-01T08:48:13Z

cc @hcho3

trivialfis

Still trying to understand the code, will give it a try later. In the meanwhile, could you please craft some specific unittests for the new inference algorithm?

src/predictor/array_tree_layout.h

trivialfis · 2025-07-01T11:52:32Z

src/predictor/cpu_predictor.cc

+   * We use transforming trees to array layout for each block of data to avoid memory overheads.
+   * It makes the array layout inefficient for block_size == 1
+   */ 
+  const bool use_array_tree_layout = block_size > 1;


What happens if this is a small online inference call? The input size could be a few samples per call.

The default (the old one) realization will be used

trivialfis · 2025-07-01T11:53:53Z

src/predictor/cpu_predictor.cc

+  for (std::size_t i = 0; i < block_size; ++i) {
+    bst_node_t nidx = 0;
+    if constexpr (use_array_tree_layout) {
+      nidx = p_nidx[i];


The optimized array_layout processing is effective only for nodes, that are close to root. For other nodes, we still use the original method.

razdoburdin · 2025-07-21T12:29:58Z

Still trying to understand the code, will give it a try later. In the meanwhile, could you please craft some specific unittests for the new inference algorithm?

I added some unit-tests.

trivialfis

I'm still trying to understand the code, in the meantime, let me do some refactoring in this and the next week to accommodate the new optimization. We need a better structure to handle all these:

Predict with scalar leaf.
Predict with vector leaf.
Array predict with scalar leaf.
Array predict with vector leaf.
Column split with scalar leaf.

I think I will split up the CPU predictor into multiple pieces.

src/predictor/cpu_predictor.cc

trivialfis · 2025-07-31T18:07:11Z

src/predictor/array_tree_layout.h

+   *  If the tree has additional levels, this array stores the node indices of the sub-trees at level kNumDeepLevels.
+   *  This is necessary to continue processing nodes that are not eligible for array-based unrolling.
+   *  The number of sub-trees packed into this array is equal to the number of nodes at tree level kNumDeepLevels,
+   *  which is calculated as (1u << kNumDeepLevels) == kNodesCount + 1.


What happens if the tree is not well balanced and is more like a linked list than a tree?

In this case we add dummy nodes with nan as split condition. In this dummy node the decision is always "go right" (see also comments https://github.com/razdoburdin/xgboost/blob/b0eaa856e1246416f7f9538bcc004e7723d9b997/src/predictor/array_tree_layout.h#L154), the left child are not initialized.

So with array layout we have to allocated all nodes, but keep some of the unpopulated in case the tree is pure balanced.

Initial tree:

Tree with dummy (nan-valued nodes):

Thank you for sharing, that makes sense.

trivialfis · 2025-07-31T18:09:34Z

src/predictor/array_tree_layout.h

+   */
+  std::array<bst_node_t, kNodesCount + 1> nidx_in_tree_;
+
+  static bool IsLeaf(const RegTree& tree, bst_node_t nidx) {


Is there a benefit of doing this C++ overloading rather than the simpler tree.IsLeaf? How much faster are we seeing?

I did the overload to handle both RegTree and MultiTargetTree cases. Is there a better option?

Use RegTree without extracting the Multi-target tree when populating the buffer, and delegate the dispatching to RegTree::LeftChild(bst_node_t nidx) instead of using the RegTree::Node::LeftChild. There's a check inside the RegTree::LeftChild:

[[nodiscard]] bst_node_t LeftChild(bst_node_t nidx) const { if (IsMultiTarget()) { return this->p_mt_tree_->LeftChild(nidx); } return (*this)[nidx].LeftChild(); }

Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>

trivialfis · 2025-08-05T19:17:25Z

I'm trying to cleanup the CPU predictor. I will update this PR once it is finished.

trivialfis · 2025-08-07T18:54:04Z

I need to fix a perf regression caused by the new ordinal encoder.

trivialfis · 2025-08-20T20:57:50Z

I need to fix a perf regression caused by the new ordinal encoder.

This has been fixed. I will look deeper into this PR.

trivialfis · 2025-08-20T20:59:55Z

src/predictor/array_tree_layout.h

+  using DefaultLeftType =
+        typename std::conditional_t<any_missing,
+                                   std::array<uint8_t, kNodesCount>,
+                                   struct Empty>;


Suggested change

struct Empty>;

Empty>;

trivialfis · 2025-08-20T21:01:17Z

src/predictor/array_tree_layout.h

+   *  If the tree has additional levels, this array stores the node indices of the sub-trees at level kNumDeepLevels.
+   *  This is necessary to continue processing nodes that are not eligible for array-based unrolling.
+   *  The number of sub-trees packed into this array is equal to the number of nodes at tree level kNumDeepLevels,
+   *  which is calculated as (1u << kNumDeepLevels) == kNodesCount + 1.


Thank you for sharing, that makes sense.

trivialfis · 2025-08-20T21:14:46Z

src/predictor/array_tree_layout.h

+   */
+  std::array<bst_node_t, kNodesCount + 1> nidx_in_tree_;
+
+  static bool IsLeaf(const RegTree& tree, bst_node_t nidx) {


Use RegTree without extracting the Multi-target tree when populating the buffer, and delegate the dispatching to RegTree::LeftChild(bst_node_t nidx) instead of using the RegTree::Node::LeftChild. There's a check inside the RegTree::LeftChild:

[[nodiscard]] bst_node_t LeftChild(bst_node_t nidx) const { if (IsMultiTarget()) { return this->p_mt_tree_->LeftChild(nidx); } return (*this)[nidx].LeftChild(); }

trivialfis · 2025-08-20T21:28:50Z

Thank you for expanding the tree layout. In the future (when you can prioritize it), do you think it's possible to create and store the layout inside the RegTree structure as an opt-in method call? My reasoning is as follows:

The existing RegTree and the multi-target tree already use a very similar layout, minus the dummy nodes. It might be easier/cleaner to do it there.
We can avoid complicating the predictor too much.
We can cache the result in the regtree structure to avoid repeated initialization.

You can define a std::unique_ptr<ArrayTree> inside the RegTree, set it to nullptr. Define a method to create the array tree when needed, and reset it back to nullptr if any non-const method is called.

Dmitry Razdoburdin and others added 12 commits May 28, 2025 04:53

basic implementation

e64e20c

optimisations

60c2ffe

fix compilation error

8f6dfe3

perf optimzation

bd13491

add categorial

3827a49

add multitarget

7334bd2

linting

8356855

perf

165b34a

fix perf

52eee0c

refactoring

cb28530

add comments

7ae3a42

more comments

2799644

razdoburdin marked this pull request as draft June 20, 2025 13:50

fix and tildy

a8bb91e

Vika-F reviewed Jun 23, 2025

View reviewed changes

src/predictor/array_tree_layout.h Show resolved Hide resolved

razdoburdin and others added 7 commits June 23, 2025 15:22

Update src/predictor/array_tree_layout.h

6d94176

Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>

add static assertions

e34becc

fix randome state usage in sycl training_continuation test

a2f2c75

Merge branch 'master' into dev/cpu/eytzinger_layout

2afad25

check if right child is valid

92ac69e

Merge branch 'dev/cpu/eytzinger_layout' of https://github.com/razdobu…

e2b0f05

…rdin/xgboost into dev/cpu/eytzinger_layout

use signed ints for node indxes

87bee15

Vika-F reviewed Jun 24, 2025

View reviewed changes

razdoburdin and others added 6 commits June 24, 2025 12:53

Update src/predictor/array_tree_layout.h

c3c1c85

Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>

Update src/predictor/array_tree_layout.h

d270ee7

Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>

Update src/predictor/array_tree_layout.h

2a7e575

Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>

Update src/predictor/array_tree_layout.h

3539ec0

Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>

Update src/predictor/array_tree_layout.h

709d233

Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>

Update src/predictor/array_tree_layout.h

40be7e2

Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>

razdoburdin and others added 2 commits June 24, 2025 12:57

Update src/predictor/cpu_predictor.cc

c9160c6

Co-authored-by: Victoriya Fedotova <viktoria.nn@gmail.com>

linting

de552e8

razdoburdin marked this pull request as ready for review June 24, 2025 12:24

trivialfis reviewed Jul 1, 2025

View reviewed changes

add tests

9c1007f

lint

92b5069

trivialfis reviewed Jul 31, 2025

View reviewed changes

razdoburdin and others added 2 commits August 4, 2025 13:35

Update src/predictor/cpu_predictor.cc

b0eaa85

Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>

Merge branch 'master' into dev/cpu/eytzinger_layout

790a98e

trivialfis reviewed Aug 20, 2025

View reviewed changes

Uh oh!

optimize CPU inference with Array-Based Tree Traversal #11519

Are you sure you want to change the base?

optimize CPU inference with Array-Based Tree Traversal #11519

Uh oh!

Conversation

razdoburdin commented Jun 20, 2025

Uh oh!

trivialfis commented Jun 21, 2025

Uh oh!

Uh oh!

Vika-F left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

razdoburdin commented Jun 24, 2025

Uh oh!

razdoburdin commented Jun 24, 2025

Uh oh!

trivialfis commented Jul 1, 2025

Uh oh!

trivialfis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

razdoburdin commented Jul 21, 2025

Uh oh!

trivialfis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trivialfis commented Aug 5, 2025

Uh oh!

trivialfis commented Aug 7, 2025

Uh oh!

trivialfis commented Aug 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trivialfis commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

trivialfis commented Aug 20, 2025 •

edited

Loading