demo-samom

Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction

Zifeng Zhao1, Rongzhi Gu1, Dongchao Yang1, Jinchuan Tian1, Yuexian Zou1, 2
1 Peking University
2 Peng Cheng Laboratory

Introduction

This is a demo for our paper Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction. In the following, we will show the performance of both supervised training and the proposed weakly supervised training(SAMoM for short) for comparison.

Block Diagram of the Proposed SAMoM Training

Demo 1: Performance on Libri2Mix[2]

Mixture
Baseline: Supervised Training
Ours: Weakly Supervised Training
Mixture
Baseline: Supervised Training
Ours: Weakly Supervised Training
Mixture
Baseline: Supervised Training
Ours: Weakly Supervised Training
Mixture
Baseline: Supervised Training
Ours: Weakly Supervised Training

Demo 2: Cross-domain Evaluation[3]

Mixture
Baseline: w/o Doamin Adaptation
Ours: w/ Doamin Adaptation
Mixture
Baseline: w/o Doamin Adaptation
Ours: w/ Doamin Adaptation
Mixture
Baseline: w/o Doamin Adaptation
Ours: w/ Doamin Adaptation
Mixture
Baseline: w/o Doamin Adaptation
Ours: w/ Doamin Adaptation

Demo 3: Noisy Scenario[2][4]

Mixture
Baseline: Supervised Training
Ours: Weakly Supervised Training
Mixture
Baseline: Supervised Training
Ours: Weakly Supervised Training
Mixture
Baseline: Supervised Training
Ours: Weakly Supervised Training
Mixture
Baseline: Supervised Training
Ours: Weakly Supervised Training

[Paper] [Bibtex] [Demo GitHub]

News

References

[1] M. Delcroix, T. Ochiai, K. Zmolikova, K. Kinoshita, N. Tawara, T. Nakatani, and S. Araki, “Improving speaker discrimination of target speech extraction with time-domain speakerbeam,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 691695.
[2] J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “Librimix: An open-source dataset for generalizable speech separation,” arXiv preprint arXiv:2005.11262, 2020.
[3] H. Bu, J. Du, X. Na, B. Wu, H. Zhang, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA). IEEE, 2017: 1-5.
[4] G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. Le Roux, “WHAM!: extending speech separation to noisy environments,” in Interspeech, 2019, pp. 1368–1372.x