no code implementations • 1 Apr 2024 • Ruijie Tao, Xinyuan Qian, Rohan Kumar Das, Xiaoxue Gao, Jiadong Wang, Haizhou Li
Audio-visual active speaker detection (AV-ASD) aims to identify which visible face is speaking in a scene with one or more persons.
no code implementations • 24 Feb 2024 • Duo Ma, Xianghu Yue, Junyi Ao, Xiaoxue Gao, Haizhou Li
In this paper, we investigate a new way to pre-train such a joint speech-text model to learn enhanced speech representations and benefit various speech-related downstream tasks.
no code implementations • 18 Nov 2022 • Xiaoxue Gao, Xianghu Yue, Haizhou Li
The current lyrics transcription approaches heavily rely on supervised learning with labeled data, but such data are scarce and manual labeling of singing is expensive.
no code implementations • 30 Oct 2022 • Xianghu Yue, Junyi Ao, Xiaoxue Gao, Haizhou Li
Firstly, due to the distinct characteristics between speech and text modalities, where speech is continuous while text is discrete, we first discretize speech into a sequence of discrete speech tokens to solve the modality mismatch problem.
no code implementations • 15 Jul 2022 • Xiaoxue Gao, Chitralekha Gupta, Haizhou Li
Lyrics transcription of polyphonic music is challenging as the background music affects lyrics intelligibility.
no code implementations • 7 Apr 2022 • Xiaoxue Gao, Chitralekha Gupta, Haizhou Li
Lyrics transcription of polyphonic music is challenging not only because the singing vocals are corrupted by the background music, but also because the background music and the singing style vary across music genres, such as pop, metal, and hip hop, which affects lyrics intelligibility of the song in different ways.
1 code implementation • 7 Apr 2022 • Xiaoxue Gao, Chitralekha Gupta, Haizhou Li
To improve the robustness of lyrics transcription to the background music, we propose a strategy of combining the features that emphasize the singing vocals, i. e. music-removed features that represent singing vocal extracted features, and the features that capture the singing vocals as well as the background music, i. e. music-present features.