Skip to content

Enhancement: Secure Data Transmission for all_reduce in TDX-based Distributed ML Training #61

@antchainmappic

Description

@antchainmappic

Dear oneCCL Team,

We are reaching out to request an enhancement for Intel oneCCL that targets secure data transmission for distributed machine learning (ML) training workloads. Specifically, We are looking for built-in encryption within oneCCL’s all_reduce operation, which is critical for secure gradient sharing across nodes equipped with Intel Trust Domain Extensions (TDX).

Use Case:
Our ML training workflows utilize PyTorch’s Distributed Data Parallel (DDP) running on a cluster of TDX-enabled nodes. While TDX provides a robust isolated execution environment, ensuring data security during all_reduce operations between TDX machines is essential for maintaining the confidentiality of sensitive gradient information.

Requirement:
The feature should enable encryption (preferably conforming to standard protocols such as TLS) for data payloads being communicated across nodes during all_reduce. The goal is to ensure that the in-flight data is protected, complementing TDX's at-rest and in-use security capabilities.

Justification:
Guards against the interception of sensitive data during distributed training
Transparently fortifies existing ML workflows without altering user code
Helps maintain the security posture promised by TDX throughout the data lifecycle

We understand performance is critical, hence suggesting this as an optional toggle where secure transmission could be enabled based on user demand.

Looking forward to your thoughts on this proposal. Thanks for your commitment to advancing collective communications.

Best regards

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions