Add ivrit.ai open Hebrew speech and language datasets (crowd-transcribe-v5, audio-v2, knesset-plenums) #2868

yanirmr · 2025-09-19T11:51:18Z

Issue #, if available:

Description of changes:
This PR adds several new Hebrew language dataset entries to the Registry of Open Data on AWS, provided by the ivrit.ai project. These datasets are made available for the purpose of AI research and model training, with licensing tailored to permit such uses.

Datasets included:

ivrit-ai Crowd-Transcribe v5: Large crowd-sourced Hebrew speech dataset for ASR and language technology development.
ivrit-ai Hebrew Audio v2: Curated Hebrew audio corpus designed to support open-source ASR research.
ivrit-ai Knesset Plenums: Aligned audio and transcriptions from Israeli Knesset (parliament) plenary proceedings, supporting parliamentary speech research and automatic recognition.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Added ivrit-ai Hebrew Audio v2 dataset with details about its composition, licensing, and resources.

The dataset includes aligned Hebrew speech and transcriptions from Israeli Knesset sessions, supporting research in political discourse and ASR.

Updated tags to include more specific terms related to natural language processing and speech recognition.

yanirmr added 7 commits September 19, 2025 14:44

Create ivrit-ai-crowdtranscribe.yaml

da85243

Add ivrit-ai Hebrew Audio v2 dataset

7937203

Added ivrit-ai Hebrew Audio v2 dataset with details about its composition, licensing, and resources.

Add ivrit-ai Knesset Plenums dataset

a213dc9

The dataset includes aligned Hebrew speech and transcriptions from Israeli Knesset sessions, supporting research in political discourse and ASR.

Merge branch 'awslabs:main' into main

5971b73

Update ivrit-ai-crowdtranscribe.yaml

f8f6570

Refine dataset tags for clarity and specificity

9ebb1fa

Updated tags to include more specific terms related to natural language processing and speech recognition.

Add ivrit-ai-audio-v2.yaml dataset file

365f5b6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ivrit.ai open Hebrew speech and language datasets (crowd-transcribe-v5, audio-v2, knesset-plenums) #2868

Add ivrit.ai open Hebrew speech and language datasets (crowd-transcribe-v5, audio-v2, knesset-plenums) #2868

Uh oh!

yanirmr commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add ivrit.ai open Hebrew speech and language datasets (crowd-transcribe-v5, audio-v2, knesset-plenums) #2868

Are you sure you want to change the base?

Add ivrit.ai open Hebrew speech and language datasets (crowd-transcribe-v5, audio-v2, knesset-plenums) #2868

Uh oh!

Conversation

yanirmr commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant