Skip to content

Format for sentence tokenizers? #6

@diyclassics

Description

@diyclassics

Comparing models to NLTK, it seems like it would be better to pickle sentence tokenizer models as type nltk.tokenize.punkt.PunktSentenceTokenizer objects as opposed to their current type of nltk.tokenize.punkt.PunktTrainer objects. Cf. language-specific files here: https://github.com/nltk/nltk_data/tree/gh-pages/packages/tokenizers

I've added an example of such a file here: https://github.com/cltk/latin_models_cltk/blob/master/tokenizers/sentence/latin_punkt.pickle

I think the 'trainer'-style pickle files should be deprecated and phased out; new code can refer to the 'tokenizer'-style pickle files in the short term and refactored when the former are officially removed.

Thoughts?

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions