-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Labels
Description
Comparing models to NLTK, it seems like it would be better to pickle sentence tokenizer models as type nltk.tokenize.punkt.PunktSentenceTokenizer
objects as opposed to their current type of nltk.tokenize.punkt.PunktTrainer
objects. Cf. language-specific files here: https://github.com/nltk/nltk_data/tree/gh-pages/packages/tokenizers
I've added an example of such a file here: https://github.com/cltk/latin_models_cltk/blob/master/tokenizers/sentence/latin_punkt.pickle
I think the 'trainer'-style pickle files should be deprecated and phased out; new code can refer to the 'tokenizer'-style pickle files in the short term and refactored when the former are officially removed.
Thoughts?