This repository provides the diarization labels of the AMI and AliMeeting datasets obtained via forced alignment, used in our ASRU 2025 paper "Can We Really Repurpose Multi-Speaker ASR Corpus for Speaker Diarization?" The labels are created using Montreal Forced Aligner with pretrained models.
Note that the provided labels are not perfect due to alignment errors, transcription ambiguity, out-of-vocabulary words, etc. But at least, it is common to regard labels obtained via forced alignment as ground truth in VAD research (e.g., [Kraljevski+, Interspeech 2025] and [Tan+, Computer Speech & Language 2020]).
- The original dataset can be downloaded from the official website.
- The train/val/eval split follows the full-corpus ASR partition.
- We used the only_words setup, which only uses the intervals marked as words, not vocal sounds. Please refer to BUTSpeechFIT/AMi-diarization-setup for more details.
- For forced alignment, we first used English (US) MFA G2P model v3.0.0 to generate pronunciations for OOV words, and then used English MFA acoustic model v3.1.0 and English (US) MFA dictionary v3.1.0 for alignment.
- The original dataset can be downloaded from OpenSLR.
- For forced alignment, we first used Mandarin (China) MFA G2P model v3.0.0 to generate pronunciations for OOV words, and then used Mandarin MFA acoustic model v3.0.0 and Mandarin (China) MFA dictionary v3.0.0 for alignment.
If you use our annotations, please cite our paper below.
@inproceedings{horiguchi_asru2025,
author = {Horiguchi, Shota and Tawara, Naohiro and Ashihara, Takanori and Ando, Atsushi and Delcroix, Marc},
title = {Can We Really Repurpose Multi-Speaker ASR Corpus for Speaker Diarization?},
booktitle={IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
year = {2025},
month = {Dec},
}