Skip to content

Commit 2aba2d3

Browse files
committed
update readme to include meta llama3
1 parent 2436204 commit 2aba2d3

File tree

1 file changed

+11
-23
lines changed

1 file changed

+11
-23
lines changed

README.md

Lines changed: 11 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -5,16 +5,14 @@
55
[![Discord](https://dcbadge.vercel.app/api/server/PKe5gvBZfn)](https://discord.gg/PKe5gvBZfn)
66

77

8-
> **Warning**<br />
9-
> ScaleLLM is currently in the active development stage and may not yet provide the optimal level of inference efficiency. We are fully dedicated to continuously enhancing its efficiency while also adding more features.
10-
8+
ScaleLLM is currently undergoing active development. We are fully committed to consistently enhancing its efficiency while also incorporating additional features. We appreciate your understanding and look forward to delivering an even better solution.
119

12-
In the coming weeks, we have exciting plans to focus on [**_speculative decoding_**](https://github.com/orgs/vectorch-ai/projects/1) and [**_stateful conversation_**](https://github.com/orgs/vectorch-ai/projects/2), alongside further kernel optimizations. We appreciate your understanding and look forward to delivering an even better solution.
10+
Feel free to explore our [**_Roadmap_**](https://github.com/vectorch-ai/ScaleLLM/issues/84) for more details.
1311

1412

1513
## Latest News:
16-
* [11/2023] - First [official release](https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.1) with support for popular open-source models.
17-
14+
* [03/2024] - We've implemented several [advanced feature enhancements](https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.7), including support for CUDA graph, dynamic prefix cache, dynamic chunked prefill and speculative decoding.
15+
* [11/2023] - We're excited to announce the first release with support for popular open-source models. Check it out [here](https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.1).
1816

1917
## Table of contents
2018

@@ -49,16 +47,6 @@ ScaleLLM is a cutting-edge inference system engineered for large language models
4947

5048
## Supported Models
5149

52-
Please note that in order to use Yi models, you need to add `--model_type=Yi` to the command line. For example:
53-
```bash
54-
docker pull docker.io/vectorchai/scalellm:latest
55-
docker run -it --gpus=all --net=host --shm-size=1g \
56-
-v $HOME/.cache/huggingface/hub:/models \
57-
-e HF_MODEL_ID=01-ai/Yi-34B-Chat-4bits \
58-
-e DEVICE=auto \
59-
docker.io/vectorchai/scalellm:latest --logtostderr --model_type=Yi
60-
```
61-
6250
| Models | Tensor Parallel | Quantization | Chat API | HF models examples |
6351
| :--------: | :-------------: | :----------: | :------: | :---------------------------:|
6452
| Aquila | Yes | Yes | Yes | [BAAI/Aquila-7B](https://huggingface.co/BAAI/Aquila-7B), [BAAI/AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B) |
@@ -69,7 +57,7 @@ docker run -it --gpus=all --net=host --shm-size=1g \
6957
| GPT_NeoX | Yes | Yes | No | [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) |
7058
| GPT2 | Yes | Yes | No | [gpt2](https://huggingface.co/gpt2)|
7159
| InternLM | Yes | Yes | Yes | [internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) |
72-
| Llama2 | Yes | Yes | Yes | [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b), [TheBloke/Llama-2-13B-chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ), [TheBloke/Llama-2-70B-AWQ](https://huggingface.co/TheBloke/Llama-2-70B-AWQ) |
60+
| Llama3/2 | Yes | Yes | Yes | [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B), [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b), [TheBloke/Llama-2-13B-chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ), [TheBloke/Llama-2-70B-AWQ](https://huggingface.co/TheBloke/Llama-2-70B-AWQ) |
7361
| Mistral | Yes | Yes | Yes | [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) |
7462
| MPT | Yes | Yes | Yes | [mosaicml/mpt-30b](https://huggingface.co/mosaicml/mpt-30b) |
7563
| Phi2 | Yes | Yes | No | [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) |
@@ -101,7 +89,7 @@ Once you have Docker installed, you can run ScaleLLM Docker container with [late
10189
docker pull docker.io/vectorchai/scalellm:latest
10290
docker run -it --gpus=all --net=host --shm-size=1g \
10391
-v $HOME/.cache/huggingface/hub:/models \
104-
-e HF_MODEL_ID=TheBloke/Llama-2-7B-chat-AWQ \
92+
-e HF_MODEL_ID=meta-llama/Meta-Llama-3-8B \
10593
-e DEVICE=cuda:0 \
10694
docker.io/vectorchai/scalellm:latest --logtostderr
10795
```
@@ -167,7 +155,7 @@ Using Docker Compose is the easiest way to run ScaleLLM with all the services to
167155

168156
```bash
169157
curl https://raw.githubusercontent.com/vectorch-ai/ScaleLLM/main/scalellm.yml -sSf > scalellm_compose.yml
170-
HF_MODEL_ID=TheBloke/Llama-2-7B-chat-AWQ DEVICE=cuda docker compose -f ./scalellm_compose.yml up
158+
HF_MODEL_ID=meta-llama/Meta-Llama-3-8B DEVICE=cuda docker compose -f ./scalellm_compose.yml up
171159
```
172160

173161
you will get following running services:
@@ -185,7 +173,7 @@ You can get chat completions with the following example:
185173
curl http://localhost:8080/v1/chat/completions \
186174
-H "Content-Type: application/json" \
187175
-d '{
188-
"model": "TheBloke/Llama-2-7B-chat-AWQ",
176+
"model": "meta-llama/Meta-Llama-3-8B",
189177
"messages": [
190178
{
191179
"role": "system",
@@ -210,7 +198,7 @@ openai.api_base = "http://localhost:8080/v1"
210198
print("==== Available models ====")
211199
models = openai.Model.list()
212200

213-
model = "TheBloke/Llama-2-7B-chat-AWQ"
201+
model = "meta-llama/Meta-Llama-3-8B"
214202

215203
completion = openai.ChatCompletion.create(
216204
model=model,
@@ -237,7 +225,7 @@ For regular completions, you can use this example:
237225
curl http://localhost:8080/v1/completions \
238226
-H "Content-Type: application/json" \
239227
-d '{
240-
"model": "TheBloke/Llama-2-7B-chat-AWQ",
228+
"model": "meta-llama/Meta-Llama-3-8B",
241229
"prompt": "hello",
242230
"max_tokens": 32,
243231
"temperature": 0.7,
@@ -256,7 +244,7 @@ openai.api_base = "http://localhost:8080/v1"
256244
print("==== Available models ====")
257245
models = openai.Model.list()
258246

259-
model = "TheBloke/Llama-2-7B-chat-AWQ"
247+
model = "meta-llama/Meta-Llama-3-8B"
260248

261249
completion = openai.Completion.create(
262250
model=model,

0 commit comments

Comments
 (0)