You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: crates/bpe/README.md
+9-9Lines changed: 9 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -94,35 +94,35 @@ Given a valid encoding sequence `e_0..e_i` and a valid encoding tuple `e_i e_j`,
94
94
## Novel Algorithm
95
95
96
96
At a first glance, it seems impossible to achieve `O(n)` complexity while preserving the encoding output of the original BPE algorithm, since the original BPE algorithm needs to first scan the full input before it can make any encoding decision.
97
-
For instance, the sequence `abab` would be encoded as `ab ab` when the dictionary contains the tokens `a b ab ba bc abc babc ababc` ordered by frequency. But appending a single character `ababc` would result in a pretty different tokenization: `ab a bc`. So without looking ahead it seems impossible to properly tokenize the text.
97
+
For instance, the sequence `abac` would be encoded as `ab ac` when the dictionary contains the tokens `a b c ab cb ac` ordered by frequency. But appending a single character `abacb` would result in a pretty different tokenization: `ab a cb`. So without looking ahead it seems impossible to properly tokenize the text.
98
98
99
-
The solution is to track the encodings of ALL text prefixes. For our example `ababc` we would get:
99
+
The solution is to track the encodings of ALL text prefixes. For our example `abacb` we would get:
100
100
101
101
-`a` ------> `a`
102
102
-`ab` -----> `ab`
103
103
-`aba` ----> `ab a`
104
-
-`abab` ---> `ab ab`
105
-
-`ababc` --> `ab a bc`
104
+
-`abab` ---> `ab ac`
105
+
-`ababc` --> `ab a cb`
106
106
107
107
This can be done much more efficiently thanks to Corollary IIa, since now only the last token of every prefix has to be remembered:
108
108
109
109
-`a` ------> `a`
110
110
-`ab` -----> `ab`
111
111
-`aba` ----> `a`
112
-
-`abab` ---> `ab`
113
-
-`ababc` --> `bc`
112
+
-`abac` ---> `ac`
113
+
-`abacb` --> `bc`
114
114
115
115
In order to reconstruct the full encoding for a specific prefix, one simply starts with the last token of that prefix, shortens the prefix by the extracted token and looks up the token associated with the shortened prefix and so on until the beginning of the text is reached.
116
116
117
-
For our example prefix `ababc`, this procedure executes the following steps and determines the correct encoding in reverse order:
117
+
For our example prefix `abacb`, this procedure executes the following steps and determines the correct encoding in reverse order:
118
118
119
-
-`ababc` -> `bc`
119
+
-`abacb` -> `cb`
120
120
-`aba` ---> `a`
121
121
-`ab` ----> `ab`
122
122
-`<empty>`
123
123
124
124
The actual challenge is to determine for every prefix this last token efficiently.
125
-
The prefix `abab` could for instance end with either the token `b` or `ab`, but only `ab` leads to a valid encoding sequence.
125
+
The prefix `abac` could for instance end with either the token `c` or `ac`, but only `ac` leads to a valid encoding sequence.
126
126
But, Corollary IIa tells us that **one and only one** last token can be the correct one and Corollary IIIa shows us how to find it:
127
127
We only have to check whether a possible next token is "compatible" with its previous token, i.e. whether the two tokens form a valid encoding sequence.
0 commit comments