-
-
Notifications
You must be signed in to change notification settings - Fork 42
Description
Hi,
I am using sudachi-rs as part of a neovim plugin to help with japanese learning.
I am trying to understand/see all possibilities for the part of speech component to declare my own enums and so on.
I've searched for an enum listing the lexicon types for instance (like 名詞
, 助詞
, 補助記号
, I want to know what are the other possible values) but I couldn't find it in here, or in sudachi dict. Unti l I reached https://github.com/WorksApplications/Sudachi but seems like there is no enum whatsoever, the 名詞
is just part of the dictionary ? Seems like the part of speech is just a list of freeform strings ? There must be some convention though, where can I find such a list ?
My goal is to reproduce in neovim the output of https://www3.nhk.or.jp/news/easy/ne2025073011585/ne2025073011585.html , ie. where locations or people names are highlighted differently .
I wonder if the tokenizer could output json on top of the current format (e.g., with --output=json
) ? Might not be good for perf but json would self-document the various part of speech fields.