Skip to content

Add support for more Latin UD datasets #3662

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

stefan-it
Copy link
Member

@stefan-it stefan-it commented May 5, 2025

Hi,

this PR adds some more Latin UD datasets:

This feature request came up in #3391.

Example

The datasets can be loaded, e.g. with:

from flair.datasets import UD_LATIN_CIRCSE, UD_LATIN_ITTB, UD_LATIN_UDANTE, UD_LATIN_PERSEUS, UD_LATIN_PROIEL

# UD_LATIN_CIRCSE
corpus_circse = UD_LATIN_CIRCSE()
str(corpus_circse)
# Outputs: Corpus: 0 train + 0 dev + 1263 test sentences

# UD_LATIN_ITTB
corpus_ittb = UD_LATIN_ITTB()
str(corpus_ittb)
# Outputs: Corpus: 22775 train + 2101 dev + 2101 test sentences

# UD_LATIN_UDANTE
corpus_udante = UD_LATIN_UDANTE()
str(corpus_udante)
# Outputs: Corpus: 926 train + 376 dev + 421 test sentences

# UD_LATIN_PERSEUS
corpus_perseus = UD_LATIN_PERSEUS()
str(corpus_perseus)
# Outputs: Corpus: 1334 train + 0 dev + 939 test sentences

# UD_LATIN_PROIEL
corpus_proiel = UD_LATIN_PROIEL()
str(corpus_proiel)
# Outputs: Corpus: 16196 train + 1233 dev + 1260 test sentences

Unittests

For all newly added datasets and the UD_LATIN dataset, unit tests were written, to check if the number of sentences match the reported number of sentences from the UD stats.

Overall Stats

Here are some overall stats of all supported UD datasets for Latin in Flair so far:

Dataset # Train sentences # Dev Sentences # Test Sentences # Total Sentences
UD_LATIN 7,289 850 884 9,023
UD_LATIN_CIRCSE 0 0 1,263 1,263
UD_LATIN_ITTB 22,775 2,101 2,101 26,977
UD_LATIN_UDANTE 926 376 421 1,723
UD_LATIN_PERSEUS 1,334 0 939 2,273
UD_LATIN_PROIEL 16,196 1,233 1,260 18,689

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant