Skip to content

Conversation

yanirmr
Copy link

@yanirmr yanirmr commented Sep 19, 2025

Issue #, if available:

Description of changes:
This PR adds several new Hebrew language dataset entries to the Registry of Open Data on AWS, provided by the ivrit.ai project. These datasets are made available for the purpose of AI research and model training, with licensing tailored to permit such uses.

Datasets included:

  • ivrit-ai Crowd-Transcribe v5: Large crowd-sourced Hebrew speech dataset for ASR and language technology development.

  • ivrit-ai Hebrew Audio v2: Curated Hebrew audio corpus designed to support open-source ASR research.

  • ivrit-ai Knesset Plenums: Aligned audio and transcriptions from Israeli Knesset (parliament) plenary proceedings, supporting parliamentary speech research and automatic recognition.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Added ivrit-ai Hebrew Audio v2 dataset with details about its composition, licensing, and resources.
The dataset includes aligned Hebrew speech and transcriptions from Israeli Knesset sessions, supporting research in political discourse and ASR.
Updated tags to include more specific terms related to natural language processing and speech recognition.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant