Skip to content

Understand the part of speech structure #304

@teto

Description

@teto

Hi,

I am using sudachi-rs as part of a neovim plugin to help with japanese learning.
I am trying to understand/see all possibilities for the part of speech component to declare my own enums and so on.

I've searched for an enum listing the lexicon types for instance (like 名詞, 助詞, 補助記号 , I want to know what are the other possible values) but I couldn't find it in here, or in sudachi dict. Unti l I reached https://github.com/WorksApplications/Sudachi but seems like there is no enum whatsoever, the 名詞 is just part of the dictionary ? Seems like the part of speech is just a list of freeform strings ? There must be some convention though, where can I find such a list ?

My goal is to reproduce in neovim the output of https://www3.nhk.or.jp/news/easy/ne2025073011585/ne2025073011585.html , ie. where locations or people names are highlighted differently .

I wonder if the tokenizer could output json on top of the current format (e.g., with --output=json) ? Might not be good for perf but json would self-document the various part of speech fields.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions