Skip to content

Conversation

LRY89757
Copy link
Contributor

  1. Optimize the data path: from List->CPU Tensor->List->rpc_param->GPU Tensor to List->rpc_param->GPU Tensor
  2. Wrap the async forward only once
  3. Only rank0 Worker runs the sampler and returns the return value
  4. Pass the rpc param to worker 0 instead of all workers, and worker 0 broadcast the param to all workers using NCCL.

The performance is not good enough, which needs to be further optimized

@LRY89757 LRY89757 requested a review from a team as a code owner May 27, 2024 07:16
@LRY89757 LRY89757 changed the title [Infer] Inference Distributed RPC Framework Optimization [WIP][Infer] Inference Distributed RPC Framework Optimization May 27, 2024
@LRY89757 LRY89757 added the tensor-parallel related to the tensor-parallel feature label May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
colossal-inference tensor-parallel related to the tensor-parallel feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant