Some assorted feature requests

Hi, 

We have been using XGBoost for several weeks and have collated a set of feature requests that we think would be helpful to us.

We have an internal open source team who may be able to help out in either xgboost or dask if you think this could work.

Do these requests look feasible to you?

Thanks

----

### 1. [Support immediate de-allocation of un-quantiled partitions](https://github.com/dmlc/xgboost/issues/11614#issue-3296848398)
When building a dask-distributed QuantileDMatrix we see CPU memory usage of approx. 2x what we would expect from the raw dataset size.
This is an issue as CPU memory is currently the bottleneck for scaling to training on a larger number of rows with the distributed approach.

It appears that this second copy is a result of not being able to de-allocate the dask partitions as soon as they are consumed by the DaskQuantileDMatrix's internal iterator (see linked issue).


### 2. Reduce memory pressure when using FP16
Currently we pass FP32 raw data partitions to the DaskQuantileDMatrix constructor. The data is then binned, greatly reducing it's footprint on device. 
One might think that passing in half-precision raw data partitions would greatly reduce CPU memory usage at the ETL stage (before hitting the GPU). However this appears not to be the case - passing in FP16 partitions reduces peak CPU memory usage by only a small fraction and it appears there is upscaling to FP32 occurs prior to quantisation under the hood.

We'd like to supply raw FP16 or FP8 partitions directly to the constructor without this upcasting to reduce peak CPU memory by 2x or 4x.

### 3. External Memory Mode extensions
In external memory mode, pre-quantiled cached pages are loaded into GPU memory on each boosting round. This method is not bottlenecked by CPU memory and hence achieves better GPU occupancy.

There are two drawbacks to this approach which would be very valuable to address:

Slow time to first boosting round: Prior to training there is a lengthy process of iterating over the dataset to build the quantile matrix pages. This took approx 30 minutes on a GH200 for a large dataset. At present these cached pages [cannot be saved to disk and reused ](https://github.com/dmlc/xgboost/issues/11614#issuecomment-3200116450)between identical runs across machines to avoid recomputation.

Limited distribution support: This method does not currently support distributed training amongst multiple GPUs - there is an [experimental section](https://xgboost.readthedocs.io/en/latest/python/examples/distributed_extmem_basic.html) in the docs which avoids using dask in favor of an internal xgboost tracker - so far I have not had luck in running this.


### 4. Support pre-quantized int values in construction of QuantileDMatrix



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Some assorted feature requests #11686

1. Support immediate de-allocation of un-quantiled partitions

2. Reduce memory pressure when using FP16

3. External Memory Mode extensions

4. Support pre-quantized int values in construction of QuantileDMatrix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Some assorted feature requests #11686

Description

1. Support immediate de-allocation of un-quantiled partitions

2. Reduce memory pressure when using FP16

3. External Memory Mode extensions

4. Support pre-quantized int values in construction of QuantileDMatrix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions