Skip to content

Performance improving chances in the future  #614

@oldcpple

Description

@oldcpple

Hi there, I've been following this work for a few months and found it's really an amazing idea to run LLMs over the Internet, while I'm also trying to improve Petals' performance on model inference in my local environment. My point of view is that, simply wrapping the Transformer library for inference is a little bit inefficient, since there are many optimization mechanisms for LLM serving in recent years' papers/projects, for example, Flash Attention, Paged Attention, Continuous Batching, etc. It would sound more if Petals could integrate any or a few of these optimizations. I wonder if authors have any future plan on this. I'm personally trying to integrate vLLM with Petals, or in another word, enabling vLLM to run on different nodes over the internet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions