-
Notifications
You must be signed in to change notification settings - Fork 50
Description
Our current alora-vllm implementation for the openai backend assumes the server is running locally on the same machine. We should think about adding support for checking for alora availability on a remote vllm server for the openai backend. The granite_common / rag-intrinsic folks have a script for downloading / loading aloras and loras during server instantiation: https://huggingface.co/ibm-granite/rag-intrinsics-lib/blob/main/run_vllm_alora.sh.
The main obstacle here is a lack of a naming convention for these aloras. We would need to synchronize (or perhaps allow the alora_path variable to be used for this). Then, when "loading" an alora / lora, we can just check the vllm model list to see if it's there. We wouldn't have to force vllm to let users load/unload aloras at runtime.