Skip to content

Commit d176e87

Browse files
authored
[refactor] change tokenizer special tokens from token to token + id. (#135)
this diff also fixed: * chat template for llama3: strip white spaces for messages * add special tokens for Yi
1 parent 257caa4 commit d176e87

16 files changed

+130
-110
lines changed

README.md

Lines changed: 16 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,29 @@
11
# ScaleLLM: An efficient LLM Inference solution
2-
[![build and test](https://github.com/vectorch-ai/ScaleLLM/actions/workflows/build.yml/badge.svg?branch=main)](https://github.com/vectorch-ai/ScaleLLM/actions/workflows/build.yml) [![GitHub Repo stars](https://img.shields.io/github/stars/vectorch-ai/ScaleLLM?style=social)](https://github.com/vectorch-ai/ScaleLLM/stargazers)
3-
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
2+
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![GitHub Repo stars](https://img.shields.io/github/stars/vectorch-ai/ScaleLLM?style=social)](https://github.com/vectorch-ai/ScaleLLM/stargazers) [![build and test](https://github.com/vectorch-ai/ScaleLLM/actions/workflows/build.yml/badge.svg?branch=main)](https://github.com/vectorch-ai/ScaleLLM/actions/workflows/build.yml)
3+
44

55
[![Discord](https://dcbadge.vercel.app/api/server/PKe5gvBZfn)](https://discord.gg/PKe5gvBZfn)
66

7+
[ScaleLLM]() is a cutting-edge inference system engineered for large language models (LLMs), meticulously designed to meet the demands of production environments. It extends its support to a wide range of popular open-source models, including [Llama3](https://github.com/meta-llama/llama3), [Gemma](https://github.com/google-deepmind/gemma), Bloom, GPT-NeoX, and more.
78

8-
ScaleLLM is currently undergoing active development. We are fully committed to consistently enhancing its efficiency while also incorporating additional features. We appreciate your understanding and look forward to delivering an even better solution.
9+
ScaleLLM is currently undergoing active development. We are fully committed to consistently enhancing its efficiency while also incorporating additional features. Feel free to explore our [**_Roadmap_**](https://github.com/vectorch-ai/ScaleLLM/issues/84) for more details.
910

10-
Feel free to explore our [**_Roadmap_**](https://github.com/vectorch-ai/ScaleLLM/issues/84) for more details.
1111

12+
## News:
13+
* [03/2024] - [Advanced feature](https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.7) support for CUDA graph, [dynamic prefix cache](), [dynamic chunked prefill]() and [speculative decoding]().
14+
* [11/2023] - [First release](https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.1) with support for popular [open-source models](#supported-models).
1215

13-
## Latest News:
14-
* [03/2024] - We've implemented several [advanced feature enhancements](https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.7), including support for CUDA graph, dynamic prefix cache, dynamic chunked prefill and speculative decoding.
15-
* [11/2023] - We're excited to announce the first release with support for popular open-source models. Check it out [here](https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.1).
16+
## Key Features
17+
18+
- [High Efficiency](): Excels in high-performance LLM inference, leveraging state-of-the-art techniques and technologies like [Flash Attention](https://github.com/Dao-AILab/flash-attention), [Paged Attention](https://github.com/vllm-project/vllm), [Continuous batching](https://www.anyscale.com/blog/continuous-batching-llm-inference), and more.
19+
- [Tensor Parallelism](): Utilizes tensor parallelism for efficient model execution.
20+
- [OpenAI-compatible API](): An efficient [golang](https://en.wikipedia.org/wiki/Go_(programming_language)) rest api server that compatible with OpenAI.
21+
- [Huggingface models](): Seamless integration with most popular [HF models](#supported-models), supporting safetensors.
22+
- [Customizable](): Offers flexibility for customization to meet your specific needs, and provides an easy way to add new models.
23+
- [Production Ready](): Engineered with production environments in mind, ScaleLLM is equipped with robust system monitoring and management features to ensure a seamless deployment experience.
1624

1725
## Table of contents
1826

19-
- [Overview](#overview)
2027
- [Supported Models](#supported-models)
2128
- [Get Started](#get-started)
2229
- [ScaleLLM server](#scalellm-server)
@@ -30,21 +37,6 @@ Feel free to explore our [**_Roadmap_**](https://github.com/vectorch-ai/ScaleLLM
3037
- [Acknowledgements](#acknowledgements)
3138
- [License](#license)
3239

33-
34-
## Overview
35-
36-
ScaleLLM is a cutting-edge inference system engineered for large language models (LLMs), meticulously designed to meet the demands of production environments. It extends its support to a wide range of popular open-source models, including Llama2, Bloom, GPT-NeoX, and more.
37-
38-
## Key Features
39-
40-
- [High Efficiency](): Excels in high-performance LLM inference, leveraging state-of-the-art techniques and technologies like [Flash Attention](https://github.com/Dao-AILab/flash-attention), [Paged Attention](https://github.com/vllm-project/vllm), [Continuous batching](https://www.anyscale.com/blog/continuous-batching-llm-inference), and more.
41-
- [Tensor Parallelism](): Utilizes tensor parallelism for efficient model execution.
42-
- [OpenAI-compatible API](): An efficient [golang](https://en.wikipedia.org/wiki/Go_(programming_language)) rest api server that compatible with OpenAI.
43-
- [Huggingface models](): Seamless integration with most popular [HF models](#supported-models), supporting safetensors.
44-
- [Customizable](): Offers flexibility for customization to meet your specific needs, and provides an easy way to add new models.
45-
- [Production Ready](): Engineered with production environments in mind, ScaleLLM is equipped with robust system monitoring and management features to ensure a seamless deployment experience.
46-
47-
4840
## Supported Models
4941

5042
| Models | Tensor Parallel | Quantization | Chat API | HF models examples |
@@ -53,6 +45,7 @@ ScaleLLM is a cutting-edge inference system engineered for large language models
5345
| Bloom | Yes | Yes | No | [bigscience/bloom](https://huggingface.co/bigscience/bloom) |
5446
| Baichuan | Yes | Yes | Yes | [baichuan-inc/Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat) |
5547
| ChatGLM3 | Yes | Yes | Yes | [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b) |
48+
| Gemma | Yes | Yes | Yes | [google/gemma-2b](https://huggingface.co/google/gemma-2b) |
5649
| GPT_j | Yes | Yes | No | [EleutherAI/gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6b) |
5750
| GPT_NeoX | Yes | Yes | No | [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) |
5851
| GPT2 | Yes | Yes | No | [gpt2](https://huggingface.co/gpt2)|

src/chat_template/common_chat_template.cpp

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
#include "common_chat_template.h"
22

3-
#include <cstdint>
3+
#include <absl/strings/ascii.h>
4+
45
#include <optional>
56
#include <sstream>
67
#include <string>
@@ -43,6 +44,8 @@ std::optional<std::string> Llama2ChatTemplate::get_prompt(
4344
}
4445

4546
// generate prompt from ChatTemplate
47+
// ref to:
48+
// https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L202
4649
std::optional<std::string> Llama3ChatTemplate::get_prompt(
4750
const std::string_view& system_message,
4851
const std::vector<std::string_view>& messages) const {
@@ -52,25 +55,29 @@ std::optional<std::string> Llama3ChatTemplate::get_prompt(
5255
}
5356

5457
std::stringstream ss;
55-
ss << "<|begin_of_text|>";
56-
auto add_message = [&ss](const std::string_view& role,
57-
const std::string_view& message) {
58+
auto add_header = [&ss](const std::string_view& role) {
5859
ss << "<|start_header_id|>" << role << "<|end_header_id|>\n\n";
59-
ss << message << "<|eot_id|>";
60+
};
61+
auto add_message = [&ss](const std::string_view& message) {
62+
// strip leading/trailing whitespaces
63+
ss << absl::StripAsciiWhitespace(message) << "<|eot_id|>";
6064
};
6165

66+
ss << "<|begin_of_text|>";
6267
// start with system message
6368
if (!system_message.empty()) {
64-
add_message("system", system_message);
69+
add_header("system");
70+
add_message(system_message);
6571
}
6672

6773
// then user and assistant message pairs (u/a/u/a/u...)
6874
for (size_t i = 0; i < messages.size(); ++i) {
6975
const char* role = i % 2 == 0 ? "user" : "assistant";
70-
add_message(role, messages[i]);
76+
add_header(role);
77+
add_message(messages[i]);
7178
}
7279
// end with assistant message
73-
ss << "<|start_header_id|>assistant<|end_header_id|>\n\n";
80+
add_header("assistant");
7481
return ss.str();
7582
}
7683

src/model_loader/args_overrider.cpp

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -63,9 +63,8 @@ DEFINE_string(tokenizer_type,
6363
"",
6464
"tokenizer type, e.g. sentencepiece, tiktoken");
6565
DEFINE_string(vocab_file, "", "vocab file name");
66-
DEFINE_string(special_tokens, "", "special tokens to add to the vocabulary");
66+
// DEFINE_string(special_tokens, "", "special tokens to add to the vocabulary");
6767
DEFINE_string(pattern, "", "regex pattern used by tiktok tokenizer");
68-
DEFINE_string(special_start_id, "", "start id for special tokens");
6968
DEFINE_string(prefix_tokens,
7069
"",
7170
"tokens to add to the beginning of the input sequence");
@@ -195,8 +194,7 @@ void override_args_from_gflag(ModelArgs& args,
195194
OVERRIDE_ARG_FROM_GFLAG(tokenizer_args, tokenizer_type);
196195
OVERRIDE_ARG_FROM_GFLAG(tokenizer_args, vocab_file);
197196
OVERRIDE_ARG_FROM_GFLAG(tokenizer_args, pattern);
198-
OVERRIDE_ARG_FROM_GFLAG(tokenizer_args, special_start_id);
199-
OVERRIDE_ARG_FROM_GFLAG(tokenizer_args, special_tokens);
197+
// OVERRIDE_ARG_FROM_GFLAG(tokenizer_args, special_tokens);
200198
OVERRIDE_ARG_FROM_GFLAG(tokenizer_args, prefix_tokens);
201199
OVERRIDE_ARG_FROM_GFLAG(tokenizer_args, chat_template);
202200
}

src/model_loader/args_overrider.h

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,9 +43,8 @@ DECLARE_string(true_sequential);
4343
// tokenizer flags
4444
DECLARE_string(tokenizer_type);
4545
DECLARE_string(vocab_file);
46-
DECLARE_string(special_tokens);
46+
// DECLARE_string(special_tokens);
4747
DECLARE_string(pattern);
48-
DECLARE_string(special_start_id);
4948
DECLARE_string(prefix_tokens);
5049
DECLARE_string(chat_template);
5150

src/model_loader/model_loader.cpp

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -170,19 +170,19 @@ std::unique_ptr<Tokenizer> HFModelLoader::tokenizer() const {
170170
// check if fast tokenizer exists
171171
const std::string tokenizer_path = model_weights_path_ + "/tokenizer.json";
172172
if (std::filesystem::exists(tokenizer_path)) {
173+
LOG(INFO) << "Using fast tokenizer.";
173174
// load fast tokenizer
174175
return HFTokenizer::from_file(tokenizer_path);
175176
}
176177

177178
// fallback to sentencepiece/tiktoken tokenizer if no fast tokenizer exists
178-
LOG(WARNING)
179-
<< "Failed to locate tokenizer.json, falling back on slow tokenizers "
180-
"instead.";
181-
182179
if (tokenizer_args_.tokenizer_type() == "tiktoken") {
180+
LOG(INFO) << "Using Tiktoken tokenizer.";
183181
return std::make_unique<TiktokenTokenizer>(model_weights_path_,
184182
tokenizer_args_);
185183
}
184+
185+
LOG(INFO) << "Using SentencePiece tokenizer.";
186186
return std::make_unique<SentencePieceTokenizer>(model_weights_path_,
187187
tokenizer_args_);
188188
}
@@ -290,10 +290,6 @@ bool HFModelLoader::load_model_args(const std::string& model_weights_path) {
290290
<< tokenizer_args_file_path;
291291
return false;
292292
}
293-
} else {
294-
// use default values if no tokenizer args loader exists
295-
LOG(WARNING) << "Failed to find tokenizer args loader for model type "
296-
<< args_.model_type();
297293
}
298294

299295
// apply args override from gflag if exists

src/models/huggingface/baichuan.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020

2121
namespace llm::hf {
2222

23-
enum BaichuanType {
23+
enum class BaichuanType : uint8_t {
2424
Baichuan_7B,
2525
Baichuan2_7B,
2626
Baichuan_13B,

src/models/huggingface/chatglm.h

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
#include "models/model_args.h"
1515
#include "models/model_registry.h"
1616
#include "models/parameters.h"
17+
#include "tokenizer/tokenizer_args.h"
1718

1819
// ChatGLM model compatible with huggingface weights
1920

@@ -546,17 +547,20 @@ REGISTER_MODEL_ARGS(chatglm, [&] {
546547
// Register tokenizer args since chatglm is using sentencepiece tokenizer.
547548
REGISTER_TOKENIZER_ARGS(chatglm, [&] {
548549
SET_ARG(tokenizer_type, "sentencepiece");
549-
// adapted from
550-
// https://huggingface.co/THUDM/chatglm3-6b/blob/main/tokenization_chatglm.py
551550
SET_ARG(vocab_file, "tokenizer.model");
552551

553552
// set special tokens
554-
// clang-format off
555-
const std::vector<std::string> special_tokens({
556-
"[MASK]", "[gMASK]", "[sMASK]", "sop", "eop",
557-
"<|system|>", "<|user|>", "<|assistant|>", "<|observation|>"
558-
});
559-
// clang-format on
553+
// ref to:
554+
// https://huggingface.co/THUDM/chatglm3-6b/blob/main/tokenizer_config.json
555+
const std::vector<SpecialToken> special_tokens({{"[MASK]", 64789},
556+
{"[gMASK]", 64790},
557+
{"[sMASK]", 64791},
558+
{"sop", 64792},
559+
{"eop", 64793},
560+
{"<|system|>", 64794},
561+
{"<|user|>", 64795},
562+
{"<|assistant|>", 64796},
563+
{"<|observation|>", 64797}});
560564
SET_ARG(special_tokens, special_tokens);
561565
SET_ARG(prefix_tokens, std::vector<std::string>({"[gMASK]", "sop"}));
562566
});

src/models/huggingface/llama.h

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -441,4 +441,21 @@ REGISTER_MODEL_ARGS(llama, [&] {
441441
}
442442
});
443443

444+
// Register tokenizer args since Yi is using sentencepiece tokenizer.
445+
REGISTER_TOKENIZER_ARGS(Yi, [&] {
446+
SET_ARG(tokenizer_type, "sentencepiece");
447+
SET_ARG(vocab_file, "tokenizer.model");
448+
449+
// set special tokens
450+
// ref to:
451+
// https://huggingface.co/01-ai/Yi-34B-Chat-4bits/blob/main/tokenizer_config.json
452+
const std::vector<SpecialToken> special_tokens({{"<unk>", 0},
453+
{"<|startoftext|>", 1},
454+
{"<|endoftext|>", 2},
455+
{"<|im_start|>", 6},
456+
{"<|im_end|>", 7},
457+
{"<|im_sep|>", 8}});
458+
SET_ARG(special_tokens, special_tokens);
459+
});
460+
444461
} // namespace llm::hf

src/models/huggingface/qwen.h

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -417,18 +417,21 @@ REGISTER_TOKENIZER_ARGS(qwen, [&] {
417417
SET_ARG(vocab_file, "qwen.tiktoken");
418418

419419
// set special tokens
420-
std::vector<std::string> special_tokens(
421-
{"<|endoftext|>", "<|im_start|>", "<|im_end|>"});
420+
std::vector<SpecialToken> special_tokens;
421+
int32_t next_id = 151643;
422+
special_tokens.emplace_back("<|endoftext|>", next_id++);
423+
special_tokens.emplace_back("<|im_start|>", next_id++);
424+
special_tokens.emplace_back("<|im_end|>", next_id++);
422425
for (int32_t i = 0; i < 205; ++i) {
423-
special_tokens.push_back("<|extra_" + std::to_string(i) + "|>");
426+
special_tokens.emplace_back("<|extra_" + std::to_string(i) + "|>",
427+
next_id++);
424428
}
425429
SET_ARG(special_tokens, special_tokens);
426430

427431
// set regex pattern for tiktoken tokenizer.
428432
const std::string pattern =
429433
R"((?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+[^\S]|\s+)";
430434
SET_ARG(pattern, pattern);
431-
SET_ARG(special_start_id, 151643);
432435
});
433436

434437
} // namespace llm::hf

src/tokenizer/sentencepiece_tokenizer.cpp

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,7 @@ SentencePieceTokenizer::SentencePieceTokenizer(const std::string_view& dir_path,
3737
// add special tokens and construct special token regex
3838
if (!args.special_tokens().empty()) {
3939
const auto vocab_size = sp_processor_.GetPieceSize();
40-
const int32_t start_id = args.special_start_id().value_or(vocab_size);
41-
load_special_tokens(args.special_tokens(), start_id);
40+
load_special_tokens(args.special_tokens());
4241
}
4342

4443
// construct prefix tokens
@@ -59,26 +58,26 @@ SentencePieceTokenizer::SentencePieceTokenizer(const std::string_view& dir_path,
5958
}
6059

6160
void SentencePieceTokenizer::load_special_tokens(
62-
const std::vector<std::string>& special_tokens,
63-
int32_t start_id) {
64-
int32_t next_id = start_id;
65-
for (const auto& token : special_tokens) {
61+
const std::vector<SpecialToken>& special_tokens) {
62+
// for each special token, add to encoder and decoder
63+
for (const auto& [token, id] : special_tokens) {
6664
if (token.empty()) {
6765
continue;
6866
}
69-
if (!special_token_encoder_.try_emplace(token, next_id).second) {
70-
LOG(WARNING) << "Duplicate special token: " << token;
67+
68+
if (!special_token_encoder_.try_emplace(token, id).second) {
69+
LOG(WARNING) << "Duplicate special token: " << token << ", id: " << id;
7170
}
72-
if (!special_token_decoder_.try_emplace(next_id, token).second) {
73-
LOG(WARNING) << "Duplicate special token id: " << next_id;
71+
72+
if (!special_token_decoder_.try_emplace(id, token).second) {
73+
LOG(WARNING) << "Duplicate special token: " << token << ", id: " << id;
7474
}
75-
++next_id;
7675
}
7776

7877
// build special token regex
7978
std::vector<std::string> escaped_tokens;
8079
escaped_tokens.reserve(special_tokens.size());
81-
for (const auto& token : special_tokens) {
80+
for (const auto& [token, id] : special_tokens) {
8281
if (token.empty()) {
8382
continue;
8483
}

0 commit comments

Comments
 (0)