no code implementations • 21 Mar 2024 • Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Panayiotis Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi
Interactions with virtual assistants typically start with a predefined trigger phrase followed by the user command.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 6 Dec 2023 • Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Panayiotis Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi
We compare the proposed system to unimodal baselines and show that the multimodal approach achieves lower equal-error-rates (EERs), while using only a fraction of the training data.
no code implementations • 21 Oct 2022 • Pranay Dighe, Prateeth Nayak, Oggi Rudovic, Erik Marchi, Xiaochuan Niu, Ahmed Tewfik
Accurate prediction of the user intent to interact with a voice assistant (VA) on a device (e. g. on the phone) is critical for achieving naturalistic, engaging, and privacy-centric interactions with the VA. To this end, we present a novel approach to predict the user's intent (the user speaking to the device or not) directly from acoustic and textual information encoded at subword tokens which are obtained via an end-to-end ASR model.
no code implementations • 5 Apr 2022 • Prateeth Nayak, Takuya Higuchi, Anmol Gupta, Shivesh Ranjan, Stephen Shum, Siddharth Sigtia, Erik Marchi, Varun Lakshminarasimhan, Minsik Cho, Saurabh Adya, Chandra Dhir, Ahmed Tewfik
A detector is typically trained on speech data independent of speaker information and used for the voice trigger detection task.
no code implementations • 30 Mar 2022 • Vineet Garg, Ognjen Rudovic, Pranay Dighe, Ahmed H. Abdelaziz, Erik Marchi, Saurabh Adya, Chandra Dhir, Ahmed Tewfik
We also show that the ensemble of the LatticeRNN and acoustic-distilled models brings further accuracy improvement of 20%.
no code implementations • 8 Feb 2022 • Vin Sachidananda, Shao-Yen Tseng, Erik Marchi, Sachin Kajarekar, Panayiotis Georgiou
By aligning audio representations to pretrained language representations and utilizing contrastive information between acoustic inputs, CALM is able to bootstrap audio embedding competitive with existing audio representation models in only a few hours of training time.
no code implementations • 13 Jan 2021 • Qiong Hu, Tobias Bleisch, Petko Petkov, Tuomo Raitio, Erik Marchi, Varun Lakshminarasimhan
2) Although our speaker verification (SV) model is not explicitly trained to discriminate different speaking styles, and no Lombard and whisper voice is used for pre-training this system, the SV model can be used as a style encoder for generating different style embeddings as input for the Tacotron system.
no code implementations • 29 Oct 2020 • Siddharth Sigtia, John Bridle, Hywel Richards, Pascal Clark, Erik Marchi, Vineet Garg
We first demonstrate that by including more audio context after a detected trigger phrase, we can indeed get a more accurate decision.
no code implementations • 20 Oct 2020 • Pranay Dighe, Erik Marchi, Srikanth Vishnubhotla, Sachin Kajarekar, Devang Naik
But in case of a false trigger, transcribing the audio using ASR itself is strongly undesirable.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 25 Apr 2020 • Zakaria Aldeneh, Anushree Prasanna Kumar, Barry-John Theobald, Erik Marchi, Sachin Kajarekar, Devang Naik, Ahmed Hussen Abdelaziz
One byproduct of this finding is that the learned visual embeddings can be used as features for other visual speech applications.
no code implementations • 10 Apr 2020 • Soumi Maiti, Erik Marchi, Alistair Conkie
We demonstrate that a bilingual speaker embedding space contains a separate distribution for each language and that a simple transform in speaker space generated by the speaker embedding can be used to control the degree of accent of a synthetic voice in a language.
no code implementations • 31 Jan 2020 • Vasudha Kowtha, Vikramjit Mitra, Chris Bartels, Erik Marchi, Sue Booker, William Caruso, Sachin Kajarekar, Devang Naik
Emotion plays an essential role in human-to-human communication, enabling us to convey feelings such as happiness, frustration, and sincerity.
no code implementations • 26 Jan 2020 • Siddharth Sigtia, Erik Marchi, Sachin Kajarekar, Devang Naik, John Bridle
We train the network in a supervised multi-task learning setup, where the speech transcription branch of the network is trained to minimise a phonetic connectionist temporal classification (CTC) loss while the speaker recognition branch of the network is trained to label the input sequence with the correct label for the speaker.
no code implementations • 28 Jun 2019 • Vikramjit Mitra, Sue Booker, Erik Marchi, David Scott Farrar, Ute Dorothea Peitz, Bridget Cheng, Ermine Teves, Anuj Mehta, Devang Naik
The expectation is that such assistants should understand the intent of the users query.
no code implementations • LREC 2016 • Simone Hantke, Erik Marchi, Bj{\"o}rn Schuller
Crowdsourcing is an arising collaborative approach applicable among many other applications to the area of language and speech processing.
no code implementations • 22 Nov 2015 • Irman Abdić, Lex Fridman, Erik Marchi, Daniel E. Brown, William Angell, Bryan Reimer, Björn Schuller
We introduce a recurrent neural network architecture for automated road surface wetness detection from audio of tire-surface interaction.
no code implementations • 24 Mar 2014 • Björn Schuller, Erik Marchi, Simon Baron-Cohen, Helen O'Reilly, Delia Pigat, Peter Robinson, Ian Daves
Individuals with Autism Spectrum Conditions (ASC) have marked difficulties using verbal and non-verbal communication for social interaction.