You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: crates/bpe/README.md
+64-22Lines changed: 64 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -183,40 +183,82 @@ On average it is about ~4 faster, since the short-cuts usually pay off.
183
183
184
184
## Benchmarks
185
185
186
-
We compared our implementations with the tiktoken implementation on a MacBook Pro on a random input sequence:
187
-
188
-
| Algorithm | Runtime | correct BPE output |
189
-
| ------------ | -------- | ---------- |
190
-
| Greedy | 100 µs | ✘ |
191
-
| Minimal | 300 µs | ✘ |
192
-
| Backtracking | 400 µs | ✔ |
193
-
| Dynamic Programming | 1300 µs | ✔ |
194
-
| TikToken | 1500 µs | ✘ |
195
-
| Heap | 1900 µs | ✔ |
196
-
197
-
As can be seen, our Backtracking implementation beats the TikToken Rust implementation by ~4x.
198
-
And even the fully dynamic programming solution is faster with a more consistent runtime.
199
-
The tuned heap implementation is still quite competitive to TikToken (especially for smaller inputs).
200
-
If the requirement of correct BPE output can be relaxed, then the Greedy approach or the minimal encoding approach are the clear winners.
186
+
We ran several benchmarks to compare performance between different encoders and with the tiktoken library:
201
187
202
-
### Counting results
188
+
- The first measuers encoding runtime for our different encoders and the tiktoken Rust implementation.
189
+
This shows a ~3.5x performance increase for our fastest correct encoder comapred to the tiktoken library.
203
190
204
-
Results for counting o200k tokens for random 10000 byte slices. The setup time of the interval encoder is comparable to backtracking. After setup counting of slices of the original data are approximately constant time.
191
+
- The second measures incremental encoding runtime, where the text is built up byte-by-byte.
192
+
This mode is not available in tiktoken, which only supports counting/encoding a complete text.
- The third measures interval counting runtime, where the token count for slices of an original text are determined.
195
+
After the initial tokenization of the text, token counting for slices is typically constant time.
196
+
This mode is not available in tiktoken, which only supports counting/encoding a complete text.
197
+
198
+
All benchmarks were run on a MacBook Pro M1.
199
+
200
+
### Encoding
201
+
202
+
Encoding is computing the tokens for a given text.
203
+
This benchmark uses several encoders:
207
204
208
-
### Encoding results
205
+
- The backtracking encoder uses a backtracking algorithm based on a string matching automaton.
206
+
- The heap encoder uses a priority heap to implement the traditional BPE algorithm.
207
+
- The table encoder uses a dynamic programming algorithm.
209
208
210
-
Results for encoding o200k tokens for random 1000 bytes. The backtracking encoder consistently outperforms tiktoken by a constant factor.
209
+
Two additional encoders are included that are faster but do not always give exact results:
210
+
211
+
- The greedy encoder uses a left-to-right greedy algorithm.
212
+
- The minimal encoder computes an encoding with the minimal number of tokens.
213
+
214
+
The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 1000 from a random 20000 token original text using the o200k token set.
215
+
(All encodings were computed from scratch for each slice.)
216
+
217
+
The graph below shows encoding runtime vs slice length.
218
+
All encoders show similar runtime increases with increasing slice length.
219
+
The backtracking encoder, the fastest encoder that still returns correct results, shows a performance gain of approximately 3.5x compared to tiktoken.
220
+
The fully dynamic programming solution and the heap implementation are still quite competitive to TikToken (especially for smaller inputs).
221
+
If the requirement of correct BPE output can be relaxed, then the Greedy approach or the minimal encoding approach are the clear winners.
Incremental encoding tokenizes a text to which bytes are appended.
228
+
This benchmark uses two encoders:
215
229
216
-
Results for incrementally encoding o200k tokens by appending 10000 random bytes. The appending encoder is slower by a constant factor but overall has similar performance curve as the backtracking encoder encoding all data at once.
230
+
- The backtracking encoder, which retokenizes the text froms cratch every time it changes.
231
+
- The appending encoder, which supports incremental encoding when bytes are added.
232
+
233
+
The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 1000 from a random 20000 token original using the o200k token set.
234
+
The backtracking encoder encoded the final text in one go.
235
+
The appending encoder got the text bytes on by one.
236
+
237
+
The graph below shows encoding runtime vs slice length.
238
+
Runtime of both encoders grows similarly with slice length.
239
+
The incremental encoder shows a constant factor overhead.
240
+
Note that this is still a huge win for incremental use cases, which would otherwise require retokenization after each append, resulting in a quadratic slowdown.
0 commit comments