Skip to content

Some assorted feature requests #11686

@jamesram415

Description

@jamesram415

Hi,

We have been using XGBoost for several weeks and have collated a set of feature requests that we think would be helpful to us.

We have an internal open source team who may be able to help out in either xgboost or dask if you think this could work.

Do these requests look feasible to you?

Thanks


1. Support immediate de-allocation of un-quantiled partitions

When building a dask-distributed QuantileDMatrix we see CPU memory usage of approx. 2x what we would expect from the raw dataset size.
This is an issue as CPU memory is currently the bottleneck for scaling to training on a larger number of rows with the distributed approach.

It appears that this second copy is a result of not being able to de-allocate the dask partitions as soon as they are consumed by the DaskQuantileDMatrix's internal iterator (see linked issue).

2. Reduce memory pressure when using FP16

Currently we pass FP32 raw data partitions to the DaskQuantileDMatrix constructor. The data is then binned, greatly reducing it's footprint on device.
One might think that passing in half-precision raw data partitions would greatly reduce CPU memory usage at the ETL stage (before hitting the GPU). However this appears not to be the case - passing in FP16 partitions reduces peak CPU memory usage by only a small fraction and it appears there is upscaling to FP32 occurs prior to quantisation under the hood.

We'd like to supply raw FP16 or FP8 partitions directly to the constructor without this upcasting to reduce peak CPU memory by 2x or 4x.

3. External Memory Mode extensions

In external memory mode, pre-quantiled cached pages are loaded into GPU memory on each boosting round. This method is not bottlenecked by CPU memory and hence achieves better GPU occupancy.

There are two drawbacks to this approach which would be very valuable to address:

Slow time to first boosting round: Prior to training there is a lengthy process of iterating over the dataset to build the quantile matrix pages. This took approx 30 minutes on a GH200 for a large dataset. At present these cached pages cannot be saved to disk and reused between identical runs across machines to avoid recomputation.

Limited distribution support: This method does not currently support distributed training amongst multiple GPUs - there is an experimental section in the docs which avoids using dask in favor of an internal xgboost tracker - so far I have not had luck in running this.

4. Support pre-quantized int values in construction of QuantileDMatrix

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions