-
Notifications
You must be signed in to change notification settings - Fork 577
Description
Hi there, I've been following this work for a few months and found it's really an amazing idea to run LLMs over the Internet, while I'm also trying to improve Petals' performance on model inference in my local environment. My point of view is that, simply wrapping the Transformer library for inference is a little bit inefficient, since there are many optimization mechanisms for LLM serving in recent years' papers/projects, for example, Flash Attention, Paged Attention, Continuous Batching, etc. It would sound more if Petals could integrate any or a few of these optimizations. I wonder if authors have any future plan on this. I'm personally trying to integrate vLLM with Petals, or in another word, enabling vLLM to run on different nodes over the internet.