Skip to content

Commit 7a42a23

Browse files
whalebot-helmsmankmike
authored andcommitted
add description for punctuation removing (#42)
* add description for punctuation removing * pep8 comments * pepier comments
1 parent c016135 commit 7a42a23

File tree

1 file changed

+8
-0
lines changed

1 file changed

+8
-0
lines changed

webstruct/text_tokenizers.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,14 @@ class DefaultTokenizer(WordTokenizer):
110110
def tokenize(self, text):
111111
tokens = super(DefaultTokenizer, self).tokenize(text)
112112
# remove standalone commas and semicolons
113+
# as they broke tag sets, e.g. PERSON->FUNCTION in case "PERSON, FUNCTION"
114+
115+
# but it has negative consequences, e.g.
116+
# etalon: [PER-B, PER-I, FUNC-B]
117+
# predicted: [PER-B, PER-I, PER-I ]
118+
# because we removed punctuation
119+
120+
# FIXME: remove as token, but save as feature left/right_punct:","
113121
return [t for t in tokens if t not in {',', ';'}]
114122

115123

0 commit comments

Comments
 (0)