Format for sentence tokenizers?

Comparing models to NLTK, it seems like it would be better to pickle sentence tokenizer models as type ```nltk.tokenize.punkt.PunktSentenceTokenizer``` objects as opposed to their current type of ```nltk.tokenize.punkt.PunktTrainer``` objects. Cf. language-specific files here: https://github.com/nltk/nltk_data/tree/gh-pages/packages/tokenizers

I've added an example of such a file here: https://github.com/cltk/latin_models_cltk/blob/master/tokenizers/sentence/latin_punkt.pickle

I think the 'trainer'-style pickle files should be deprecated and phased out; new code can refer to the 'tokenizer'-style pickle files in the short term and refactored when the former are officially removed.

Thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Format for sentence tokenizers? #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Format for sentence tokenizers? #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions