You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+16-23Lines changed: 16 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,22 +1,29 @@
1
1
# ScaleLLM: An efficient LLM Inference solution
2
-
[](https://github.com/vectorch-ai/ScaleLLM/actions/workflows/build.yml)[](https://github.com/vectorch-ai/ScaleLLM/stargazers)
[](https://opensource.org/licenses/Apache-2.0)[](https://github.com/vectorch-ai/ScaleLLM/stargazers)[](https://github.com/vectorch-ai/ScaleLLM/actions/workflows/build.yml)
[ScaleLLM]() is a cutting-edge inference system engineered for large language models (LLMs), meticulously designed to meet the demands of production environments. It extends its support to a wide range of popular open-source models, including [Llama3](https://github.com/meta-llama/llama3), [Gemma](https://github.com/google-deepmind/gemma), Bloom, GPT-NeoX, and more.
7
8
8
-
ScaleLLM is currently undergoing active development. We are fully committed to consistently enhancing its efficiency while also incorporating additional features. We appreciate your understanding and look forward to delivering an even better solution.
9
+
ScaleLLM is currently undergoing active development. We are fully committed to consistently enhancing its efficiency while also incorporating additional features. Feel free to explore our [**_Roadmap_**](https://github.com/vectorch-ai/ScaleLLM/issues/84) for more details.
9
10
10
-
Feel free to explore our [**_Roadmap_**](https://github.com/vectorch-ai/ScaleLLM/issues/84) for more details.
11
11
12
+
## News:
13
+
*[03/2024] - [Advanced feature](https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.7) support for CUDA graph, [dynamic prefix cache](), [dynamic chunked prefill]() and [speculative decoding]().
14
+
*[11/2023] - [First release](https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.1) with support for popular [open-source models](#supported-models).
12
15
13
-
## Latest News:
14
-
*[03/2024] - We've implemented several [advanced feature enhancements](https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.7), including support for CUDA graph, dynamic prefix cache, dynamic chunked prefill and speculative decoding.
15
-
*[11/2023] - We're excited to announce the first release with support for popular open-source models. Check it out [here](https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.1).
16
+
## Key Features
17
+
18
+
-[High Efficiency](): Excels in high-performance LLM inference, leveraging state-of-the-art techniques and technologies like [Flash Attention](https://github.com/Dao-AILab/flash-attention), [Paged Attention](https://github.com/vllm-project/vllm), [Continuous batching](https://www.anyscale.com/blog/continuous-batching-llm-inference), and more.
19
+
-[Tensor Parallelism](): Utilizes tensor parallelism for efficient model execution.
20
+
-[OpenAI-compatible API](): An efficient [golang](https://en.wikipedia.org/wiki/Go_(programming_language)) rest api server that compatible with OpenAI.
21
+
-[Huggingface models](): Seamless integration with most popular [HF models](#supported-models), supporting safetensors.
22
+
-[Customizable](): Offers flexibility for customization to meet your specific needs, and provides an easy way to add new models.
23
+
-[Production Ready](): Engineered with production environments in mind, ScaleLLM is equipped with robust system monitoring and management features to ensure a seamless deployment experience.
16
24
17
25
## Table of contents
18
26
19
-
-[Overview](#overview)
20
27
-[Supported Models](#supported-models)
21
28
-[Get Started](#get-started)
22
29
-[ScaleLLM server](#scalellm-server)
@@ -30,21 +37,6 @@ Feel free to explore our [**_Roadmap_**](https://github.com/vectorch-ai/ScaleLLM
30
37
-[Acknowledgements](#acknowledgements)
31
38
-[License](#license)
32
39
33
-
34
-
## Overview
35
-
36
-
ScaleLLM is a cutting-edge inference system engineered for large language models (LLMs), meticulously designed to meet the demands of production environments. It extends its support to a wide range of popular open-source models, including Llama2, Bloom, GPT-NeoX, and more.
37
-
38
-
## Key Features
39
-
40
-
-[High Efficiency](): Excels in high-performance LLM inference, leveraging state-of-the-art techniques and technologies like [Flash Attention](https://github.com/Dao-AILab/flash-attention), [Paged Attention](https://github.com/vllm-project/vllm), [Continuous batching](https://www.anyscale.com/blog/continuous-batching-llm-inference), and more.
41
-
-[Tensor Parallelism](): Utilizes tensor parallelism for efficient model execution.
42
-
-[OpenAI-compatible API](): An efficient [golang](https://en.wikipedia.org/wiki/Go_(programming_language)) rest api server that compatible with OpenAI.
43
-
-[Huggingface models](): Seamless integration with most popular [HF models](#supported-models), supporting safetensors.
44
-
-[Customizable](): Offers flexibility for customization to meet your specific needs, and provides an easy way to add new models.
45
-
-[Production Ready](): Engineered with production environments in mind, ScaleLLM is equipped with robust system monitoring and management features to ensure a seamless deployment experience.
0 commit comments