[refactor] change tokenizer special tokens from token to token + id. (#135)

guocuimi · web-flow · commit d176e879a871 · 2024-04-19T17:20:04.000-07:00
this diff also fixed:
* chat template for llama3: strip white spaces for messages
* add special tokens for Yi
diff --git a/README.md b/README.md
@@ -1,22 +1,29 @@
 # ScaleLLM: An efficient LLM Inference solution
-[![build and test](https://github.com/vectorch-ai/ScaleLLM/actions/workflows/build.yml/badge.svg?branch=main)](https://github.com/vectorch-ai/ScaleLLM/actions/workflows/build.yml) [![GitHub Repo stars](https://img.shields.io/github/stars/vectorch-ai/ScaleLLM?style=social)](https://github.com/vectorch-ai/ScaleLLM/stargazers)
-[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
+[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![GitHub Repo stars](https://img.shields.io/github/stars/vectorch-ai/ScaleLLM?style=social)](https://github.com/vectorch-ai/ScaleLLM/stargazers) [![build and test](https://github.com/vectorch-ai/ScaleLLM/actions/workflows/build.yml/badge.svg?branch=main)](https://github.com/vectorch-ai/ScaleLLM/actions/workflows/build.yml) 
+
 
 [![Discord](https://dcbadge.vercel.app/api/server/PKe5gvBZfn)](https://discord.gg/PKe5gvBZfn)
 
+[ScaleLLM]() is a cutting-edge inference system engineered for large language models (LLMs), meticulously designed to meet the demands of production environments. It extends its support to a wide range of popular open-source models, including [Llama3](https://github.com/meta-llama/llama3), [Gemma](https://github.com/google-deepmind/gemma), Bloom, GPT-NeoX, and more. 
 
-ScaleLLM is currently undergoing active development. We are fully committed to consistently enhancing its efficiency while also incorporating additional features. We appreciate your understanding and look forward to delivering an even better solution.
+ScaleLLM is currently undergoing active development. We are fully committed to consistently enhancing its efficiency while also incorporating additional features. Feel free to explore our [**_Roadmap_**](https://github.com/vectorch-ai/ScaleLLM/issues/84) for more details.
 
-Feel free to explore our [**_Roadmap_**](https://github.com/vectorch-ai/ScaleLLM/issues/84) for more details.
 
+## News:
+* [03/2024] - [Advanced feature](https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.7) support for CUDA graph, [dynamic prefix cache](), [dynamic chunked prefill]() and [speculative decoding]().
+* [11/2023] - [First release](https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.1) with support for popular [open-source models](#supported-models).
 
-## Latest News:
-* [03/2024] - We've implemented several [advanced feature enhancements](https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.7), including support for CUDA graph, dynamic prefix cache, dynamic chunked prefill and speculative decoding.
-* [11/2023] - We're excited to announce the first release with support for popular open-source models. Check it out [here](https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.1).
+## Key Features
+
+- [High Efficiency](): Excels in high-performance LLM inference, leveraging state-of-the-art techniques and technologies like [Flash Attention](https://github.com/Dao-AILab/flash-attention), [Paged Attention](https://github.com/vllm-project/vllm), [Continuous batching](https://www.anyscale.com/blog/continuous-batching-llm-inference), and more.
+- [Tensor Parallelism](): Utilizes tensor parallelism for efficient model execution.
+- [OpenAI-compatible API](): An efficient [golang](https://en.wikipedia.org/wiki/Go_(programming_language)) rest api server that compatible with OpenAI.
+- [Huggingface models](): Seamless integration with most popular [HF models](#supported-models), supporting safetensors.
+- [Customizable](): Offers flexibility for customization to meet your specific needs, and provides an easy way to add new models.
+- [Production Ready](): Engineered with production environments in mind, ScaleLLM is equipped with robust system monitoring and management features to ensure a seamless deployment experience.
 
 ## Table of contents
 
-- [Overview](#overview)
 - [Supported Models](#supported-models)
 - [Get Started](#get-started)
   - [ScaleLLM server](#scalellm-server)
@@ -30,21 +37,6 @@ Feel free to explore our [**_Roadmap_**](https://github.com/vectorch-ai/ScaleLLM
 - [Acknowledgements](#acknowledgements)
 - [License](#license)
 
-
-## Overview
-
-ScaleLLM is a cutting-edge inference system engineered for large language models (LLMs), meticulously designed to meet the demands of production environments. It extends its support to a wide range of popular open-source models, including Llama2, Bloom, GPT-NeoX, and more. 
-
-## Key Features
-
-- [High Efficiency](): Excels in high-performance LLM inference, leveraging state-of-the-art techniques and technologies like [Flash Attention](https://github.com/Dao-AILab/flash-attention), [Paged Attention](https://github.com/vllm-project/vllm), [Continuous batching](https://www.anyscale.com/blog/continuous-batching-llm-inference), and more.
-- [Tensor Parallelism](): Utilizes tensor parallelism for efficient model execution.
-- [OpenAI-compatible API](): An efficient [golang](https://en.wikipedia.org/wiki/Go_(programming_language)) rest api server that compatible with OpenAI.
-- [Huggingface models](): Seamless integration with most popular [HF models](#supported-models), supporting safetensors.
-- [Customizable](): Offers flexibility for customization to meet your specific needs, and provides an easy way to add new models.
-- [Production Ready](): Engineered with production environments in mind, ScaleLLM is equipped with robust system monitoring and management features to ensure a seamless deployment experience.
-
-
 ## Supported Models
 
 |   Models   | Tensor Parallel | Quantization | Chat API | HF models examples |
@@ -53,6 +45,7 @@ ScaleLLM is a cutting-edge inference system engineered for large language models
 |   Bloom    |       Yes       |     Yes      |    No    | [bigscience/bloom](https://huggingface.co/bigscience/bloom) |
 |   Baichuan |       Yes       |     Yes      |    Yes   | [baichuan-inc/Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat) |
 |   ChatGLM3 |       Yes       |     Yes      |    Yes   | [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b) |
+|   Gemma    |       Yes       |     Yes      |    Yes   | [google/gemma-2b](https://huggingface.co/google/gemma-2b) |
 |   GPT_j    |       Yes       |     Yes      |    No    | [EleutherAI/gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6b) |
 |  GPT_NeoX  |       Yes       |     Yes      |    No    | [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) |
 |    GPT2    |       Yes       |     Yes      |    No    | [gpt2](https://huggingface.co/gpt2)|
diff --git a/src/chat_template/common_chat_template.cpp b/src/chat_template/common_chat_template.cpp
@@ -1,6 +1,7 @@
 #include "common_chat_template.h"
 
-#include <cstdint>
+#include <absl/strings/ascii.h>
+
 #include <optional>
 #include <sstream>
 #include <string>
@@ -43,6 +44,8 @@ std::optional<std::string> Llama2ChatTemplate::get_prompt(
 }
 
 // generate prompt from ChatTemplate
+// ref to:
+// https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L202
 std::optional<std::string> Llama3ChatTemplate::get_prompt(
     const std::string_view& system_message,
     const std::vector<std::string_view>& messages) const {
@@ -52,25 +55,29 @@ std::optional<std::string> Llama3ChatTemplate::get_prompt(
   }
 
   std::stringstream ss;
-  ss << "<|begin_of_text|>";
-  auto add_message = [&ss](const std::string_view& role,
-                           const std::string_view& message) {
+  auto add_header = [&ss](const std::string_view& role) {
     ss << "<|start_header_id|>" << role << "<|end_header_id|>\n\n";
-    ss << message << "<|eot_id|>";
+  };
+  auto add_message = [&ss](const std::string_view& message) {
+    // strip leading/trailing whitespaces
+    ss << absl::StripAsciiWhitespace(message) << "<|eot_id|>";
   };
 
+  ss << "<|begin_of_text|>";
   // start with system message
   if (!system_message.empty()) {
-    add_message("system", system_message);
+    add_header("system");
+    add_message(system_message);
   }
 
   // then user and assistant message pairs (u/a/u/a/u...)
   for (size_t i = 0; i < messages.size(); ++i) {
     const char* role = i % 2 == 0 ? "user" : "assistant";
-    add_message(role, messages[i]);
+    add_header(role);
+    add_message(messages[i]);
   }
   // end with assistant message
-  ss << "<|start_header_id|>assistant<|end_header_id|>\n\n";
+  add_header("assistant");
   return ss.str();
 }
 
diff --git a/src/model_loader/args_overrider.cpp b/src/model_loader/args_overrider.cpp
@@ -63,9 +63,8 @@ DEFINE_string(tokenizer_type,
               "",
               "tokenizer type, e.g. sentencepiece, tiktoken");
 DEFINE_string(vocab_file, "", "vocab file name");
-DEFINE_string(special_tokens, "", "special tokens to add to the vocabulary");
+// DEFINE_string(special_tokens, "", "special tokens to add to the vocabulary");
 DEFINE_string(pattern, "", "regex pattern used by tiktok tokenizer");
-DEFINE_string(special_start_id, "", "start id for special tokens");
 DEFINE_string(prefix_tokens,
               "",
               "tokens to add to the beginning of the input sequence");
@@ -195,8 +194,7 @@ void override_args_from_gflag(ModelArgs& args,
   OVERRIDE_ARG_FROM_GFLAG(tokenizer_args, tokenizer_type);
   OVERRIDE_ARG_FROM_GFLAG(tokenizer_args, vocab_file);
   OVERRIDE_ARG_FROM_GFLAG(tokenizer_args, pattern);
-  OVERRIDE_ARG_FROM_GFLAG(tokenizer_args, special_start_id);
-  OVERRIDE_ARG_FROM_GFLAG(tokenizer_args, special_tokens);
+  // OVERRIDE_ARG_FROM_GFLAG(tokenizer_args, special_tokens);
   OVERRIDE_ARG_FROM_GFLAG(tokenizer_args, prefix_tokens);
   OVERRIDE_ARG_FROM_GFLAG(tokenizer_args, chat_template);
 }
diff --git a/src/model_loader/args_overrider.h b/src/model_loader/args_overrider.h
@@ -43,9 +43,8 @@ DECLARE_string(true_sequential);
 // tokenizer flags
 DECLARE_string(tokenizer_type);
 DECLARE_string(vocab_file);
-DECLARE_string(special_tokens);
+// DECLARE_string(special_tokens);
 DECLARE_string(pattern);
-DECLARE_string(special_start_id);
 DECLARE_string(prefix_tokens);
 DECLARE_string(chat_template);
 
diff --git a/src/model_loader/model_loader.cpp b/src/model_loader/model_loader.cpp
@@ -170,19 +170,19 @@ std::unique_ptr<Tokenizer> HFModelLoader::tokenizer() const {
   // check if fast tokenizer exists
   const std::string tokenizer_path = model_weights_path_ + "/tokenizer.json";
   if (std::filesystem::exists(tokenizer_path)) {
+    LOG(INFO) << "Using fast tokenizer.";
     // load fast tokenizer
     return HFTokenizer::from_file(tokenizer_path);
   }
 
   // fallback to sentencepiece/tiktoken tokenizer if no fast tokenizer exists
-  LOG(WARNING)
-      << "Failed to locate tokenizer.json, falling back on slow tokenizers "
-         "instead.";
-
   if (tokenizer_args_.tokenizer_type() == "tiktoken") {
+    LOG(INFO) << "Using Tiktoken tokenizer.";
     return std::make_unique<TiktokenTokenizer>(model_weights_path_,
                                                tokenizer_args_);
   }
+
+  LOG(INFO) << "Using SentencePiece tokenizer.";
   return std::make_unique<SentencePieceTokenizer>(model_weights_path_,
                                                   tokenizer_args_);
 }
@@ -290,10 +290,6 @@ bool HFModelLoader::load_model_args(const std::string& model_weights_path) {
                  << tokenizer_args_file_path;
       return false;
     }
-  } else {
-    // use default values if no tokenizer args loader exists
-    LOG(WARNING) << "Failed to find tokenizer args loader for model type "
-                 << args_.model_type();
   }
 
   // apply args override from gflag if exists
diff --git a/src/models/huggingface/baichuan.h b/src/models/huggingface/baichuan.h
@@ -20,7 +20,7 @@
 
 namespace llm::hf {
 
-enum BaichuanType {
+enum class BaichuanType : uint8_t {
   Baichuan_7B,
   Baichuan2_7B,
   Baichuan_13B,
diff --git a/src/models/huggingface/chatglm.h b/src/models/huggingface/chatglm.h
@@ -14,6 +14,7 @@
 #include "models/model_args.h"
 #include "models/model_registry.h"
 #include "models/parameters.h"
+#include "tokenizer/tokenizer_args.h"
 
 // ChatGLM model compatible with huggingface weights
 
@@ -546,17 +547,20 @@ REGISTER_MODEL_ARGS(chatglm, [&] {
 // Register tokenizer args since chatglm is using sentencepiece tokenizer.
 REGISTER_TOKENIZER_ARGS(chatglm, [&] {
   SET_ARG(tokenizer_type, "sentencepiece");
-  // adapted from
-  // https://huggingface.co/THUDM/chatglm3-6b/blob/main/tokenization_chatglm.py
   SET_ARG(vocab_file, "tokenizer.model");
 
   // set special tokens
-  // clang-format off
-  const std::vector<std::string> special_tokens({
-    "[MASK]", "[gMASK]", "[sMASK]", "sop", "eop",
-    "<|system|>", "<|user|>", "<|assistant|>", "<|observation|>"
-  });
-  // clang-format on
+  // ref to:
+  // https://huggingface.co/THUDM/chatglm3-6b/blob/main/tokenizer_config.json
+  const std::vector<SpecialToken> special_tokens({{"[MASK]", 64789},
+                                                  {"[gMASK]", 64790},
+                                                  {"[sMASK]", 64791},
+                                                  {"sop", 64792},
+                                                  {"eop", 64793},
+                                                  {"<|system|>", 64794},
+                                                  {"<|user|>", 64795},
+                                                  {"<|assistant|>", 64796},
+                                                  {"<|observation|>", 64797}});
   SET_ARG(special_tokens, special_tokens);
   SET_ARG(prefix_tokens, std::vector<std::string>({"[gMASK]", "sop"}));
 });
diff --git a/src/models/huggingface/llama.h b/src/models/huggingface/llama.h
@@ -441,4 +441,21 @@ REGISTER_MODEL_ARGS(llama, [&] {
   }
 });
 
+// Register tokenizer args since Yi is using sentencepiece tokenizer.
+REGISTER_TOKENIZER_ARGS(Yi, [&] {
+  SET_ARG(tokenizer_type, "sentencepiece");
+  SET_ARG(vocab_file, "tokenizer.model");
+
+  // set special tokens
+  // ref to:
+  // https://huggingface.co/01-ai/Yi-34B-Chat-4bits/blob/main/tokenizer_config.json
+  const std::vector<SpecialToken> special_tokens({{"<unk>", 0},
+                                                  {"<|startoftext|>", 1},
+                                                  {"<|endoftext|>", 2},
+                                                  {"<|im_start|>", 6},
+                                                  {"<|im_end|>", 7},
+                                                  {"<|im_sep|>", 8}});
+  SET_ARG(special_tokens, special_tokens);
+});
+
 }  // namespace llm::hf
diff --git a/src/models/huggingface/qwen.h b/src/models/huggingface/qwen.h
@@ -417,18 +417,21 @@ REGISTER_TOKENIZER_ARGS(qwen, [&] {
   SET_ARG(vocab_file, "qwen.tiktoken");
 
   // set special tokens
-  std::vector<std::string> special_tokens(
-      {"<|endoftext|>", "<|im_start|>", "<|im_end|>"});
+  std::vector<SpecialToken> special_tokens;
+  int32_t next_id = 151643;
+  special_tokens.emplace_back("<|endoftext|>", next_id++);
+  special_tokens.emplace_back("<|im_start|>", next_id++);
+  special_tokens.emplace_back("<|im_end|>", next_id++);
   for (int32_t i = 0; i < 205; ++i) {
-    special_tokens.push_back("<|extra_" + std::to_string(i) + "|>");
+    special_tokens.emplace_back("<|extra_" + std::to_string(i) + "|>",
+                                next_id++);
   }
   SET_ARG(special_tokens, special_tokens);
 
   // set regex pattern for tiktoken tokenizer.
   const std::string pattern =
       R"((?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+[^\S]|\s+)";
   SET_ARG(pattern, pattern);
-  SET_ARG(special_start_id, 151643);
 });
 
 }  // namespace llm::hf
diff --git a/src/tokenizer/sentencepiece_tokenizer.cpp b/src/tokenizer/sentencepiece_tokenizer.cpp
@@ -37,8 +37,7 @@ SentencePieceTokenizer::SentencePieceTokenizer(const std::string_view& dir_path,
   // add special tokens and construct special token regex
   if (!args.special_tokens().empty()) {
     const auto vocab_size = sp_processor_.GetPieceSize();
-    const int32_t start_id = args.special_start_id().value_or(vocab_size);
-    load_special_tokens(args.special_tokens(), start_id);
+    load_special_tokens(args.special_tokens());
   }
 
   // construct prefix tokens
@@ -59,26 +58,26 @@ SentencePieceTokenizer::SentencePieceTokenizer(const std::string_view& dir_path,
 }
 
 void SentencePieceTokenizer::load_special_tokens(
-    const std::vector<std::string>& special_tokens,
-    int32_t start_id) {
-  int32_t next_id = start_id;
-  for (const auto& token : special_tokens) {
+    const std::vector<SpecialToken>& special_tokens) {
+  // for each special token, add to encoder and decoder
+  for (const auto& [token, id] : special_tokens) {
     if (token.empty()) {
       continue;
     }
-    if (!special_token_encoder_.try_emplace(token, next_id).second) {
-      LOG(WARNING) << "Duplicate special token: " << token;
+
+    if (!special_token_encoder_.try_emplace(token, id).second) {
+      LOG(WARNING) << "Duplicate special token: " << token << ", id: " << id;
     }
-    if (!special_token_decoder_.try_emplace(next_id, token).second) {
-      LOG(WARNING) << "Duplicate special token id: " << next_id;
+
+    if (!special_token_decoder_.try_emplace(id, token).second) {
+      LOG(WARNING) << "Duplicate special token: " << token << ", id: " << id;
     }
-    ++next_id;
   }
 
   // build special token regex
   std::vector<std::string> escaped_tokens;
   escaped_tokens.reserve(special_tokens.size());
-  for (const auto& token : special_tokens) {
+  for (const auto& [token, id] : special_tokens) {
     if (token.empty()) {
       continue;
     }
diff --git a/src/tokenizer/sentencepiece_tokenizer.h b/src/tokenizer/sentencepiece_tokenizer.h
@@ -27,8 +27,7 @@ class SentencePieceTokenizer : public Tokenizer {
   std::unique_ptr<Tokenizer> clone() const override;
 
  private:
-  void load_special_tokens(const std::vector<std::string>& special_tokens,
-                           int32_t start_id);
+  void load_special_tokens(const std::vector<SpecialToken>& special_tokens);
 
   bool encode_internal(const std::string_view& text,
                        std::vector<int32_t>* ids) const;
diff --git a/src/tokenizer/sentencepiece_tokenizer_test.cpp b/src/tokenizer/sentencepiece_tokenizer_test.cpp
@@ -50,22 +50,27 @@ TEST(SentencePieceTokenizerTest, CJKTest) {
 }
 
 TEST(SentencePieceTokenizerTest, SpecialTokenTest) {
-  // clang-format off
-  std::vector<std::string> special_tokens = {
-    "[gMASK]", "[sMASK]", "sop", "eop",
-    "<|system|>", "<|user|>", "<|assistant|>", "<|observation|>"
+  std::vector<SpecialToken> special_tokens = {
+      SpecialToken("[gMASK]", 32000),
+      SpecialToken("[sMASK]", 32001),
+      SpecialToken("sop", 32002),
+      SpecialToken("eop", 32003),
+      SpecialToken("<|system|>", 32004),
+      SpecialToken("<|user|>", 32005),
+      SpecialToken("<|assistant|>", 32006),
+      SpecialToken("<|observation|>", 32007),
   };
-  // clang-format on
   TokenizerArgs args;
   args.vocab_file() = "tokenizer.model";
   args.special_tokens() = special_tokens;
   SentencePieceTokenizer tokenizer("data", args);
   EXPECT_EQ(tokenizer.vocab_size(), 32000 + special_tokens.size());
   // test encode each special token
-  for (const auto& token : special_tokens) {
+  for (const auto& [token, id] : special_tokens) {
     std::vector<int> ids;
     ASSERT_TRUE(tokenizer.encode(token, &ids));
     EXPECT_EQ(ids.size(), 1);
+    EXPECT_EQ(ids[0], id);
     {
       const auto decoded_token =
           tokenizer.decode(ids, /*skip_special_tokens=*/false);
diff --git a/src/tokenizer/tiktoken_tokenizer.cpp b/src/tokenizer/tiktoken_tokenizer.cpp
diff --git a/src/tokenizer/tiktoken_tokenizer.h b/src/tokenizer/tiktoken_tokenizer.h
diff --git a/src/tokenizer/tiktoken_tokenizer_test.cpp b/src/tokenizer/tiktoken_tokenizer_test.cpp
diff --git a/src/tokenizer/tokenizer_args.h b/src/tokenizer/tokenizer_args.h