A Benchmark for Voice-Face Cross-Modal Matching and Retrieval

1 Jan 2021 · Chuyuan Xiong, Deyuan Zhang, Tao Liu, Xiaoyong Du, Jiankun Tian, Songyan Xue ·

Cross-modal associations between a person's voice and face can be learned algorithmically, and this is a useful functionality in many audio and visual applications. The problem can be defined as two tasks: voice-face matching and retrieval. Recently, this topic has attracted much research attention, but it is still in its early stages of development, and evaluation protocols and test schemes need to be more standardized. Performance metrics for different subtasks are also scarce, and a benchmark for this problem needs to be established. In this paper, a baseline evaluation framework is proposed for voice-face matching and retrieval tasks. Test confidence is analyzed, and a confidence interval for estimated accuracy is proposed. Various state-of-the-art performances with high test confidence are achieved on a series of subtasks using the baseline method (called TriNet) included in this framework. The source code will be published along with the paper. The results of this study can provide a basis for future research on voice-face cross-modal learning.

PDF Abstract