1 code implementation • 20 Dec 2023 • Ashish Seth, Sreyan Ghosh, S. Umesh, Dinesh Manocha
Specifically, first, we perform vanilla continued pre-training on an initial SSL pre-trained model on the target domain ASR dataset and call it the teacher.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
1 code implementation • 20 Dec 2023 • Ashish Seth, Sreyan Ghosh, S. Umesh, Dinesh Manocha
Continued pre-training (CP) offers multiple advantages, like target domain adaptation and the potential to exploit the continuous stream of unlabeled data available online.
no code implementations • 12 Oct 2023 • Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Evuru, S. Ramaneswaran, S. Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
In this paper, we propose CompA, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs.
no code implementations • CVPR 2023 • Ashish Seth, Mayur Hemani, Chirag Agarwal
These biases manifest as the skewed similarity between the representations for specific text concepts and images of people of different identity groups and, therefore, limit the usefulness of such models in real-world high-stakes applications.
1 code implementation • 10 Mar 2023 • Ashish Seth, Sreyan Ghosh, S. Umesh, Dinesh Manocha
Unlike prior works, which directly fine-tune a self-supervised pre-trained encoder on a target dataset, we use the encoder to generate pseudo-labels for unsupervised fine-tuning before the actual fine-tuning step.
1 code implementation • 2 Nov 2022 • Ashish Seth, Sreyan Ghosh, S. Umesh, Dinesh Manocha
We present a new Self-Supervised Learning (SSL) approach to pre-train encoders on unlabeled audio data that reduces the need for large amounts of labeled data for audio and speech classification.
1 code implementation • 2 Nov 2022 • Sreyan Ghosh, Ashish Seth, S. Umesh, Dinesh Manocha
We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST).
no code implementations • 1 Nov 2022 • Anusha Prakash, Arun Kumar, Ashish Seth, Bhagyashree Mukherjee, Ishika Gupta, Jom Kuriakose, Jordan Fernandes, K V Vikram, Mano Ranjith Kumar M, Metilda Sagaya Mary, Mohammad Wajahat, Mohana N, Mudit Batra, Navina K, Nihal John George, Nithya Ravi, Pruthwik Mishra, Sudhanshu Srivastava, Vasista Sai Lodagala, Vandan Mujadia, Kada Sai Venkata Vineeth, Vrunda Sukhadia, Dipti Sharma, Hema Murthy, Pushpak Bhattacharya, S Umesh, Rajeev Sangal
Cross-lingual dubbing of lecture videos requires the transcription of the original audio, correction and removal of disfluencies, domain term discovery, text-to-text translation into the target language, chunking of text using target language rhythm, text-to-speech synthesis followed by isochronous lipsyncing to the original video.
no code implementations • 31 Mar 2022 • Ashish Seth, Lodagala V S V Durga Prasad, Sreyan Ghosh, S. Umesh
Self-supervised learning (SSL) to learn high-level speech representations has been a popular approach to building Automatic Speech Recognition (ASR) systems in low-resource settings.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
1 code implementation • 25 Mar 2022 • Sreyan Ghosh, Ashish Seth, and Deepak Mittal, Maneesh Singh, S. Umesh
Inspired by the recent progress in self-supervised learning for computer vision, in this paper we introduce DeLoRes, a new general-purpose audio representation learning approach.
1 code implementation • 17 Oct 2021 • Sreyan Ghosh, Sandesh V Katta, Ashish Seth, S. Umesh
We introduce DECAR, a self-supervised pre-training approach for learning general-purpose audio representations.
no code implementations • 2 Jun 2021 • Mari Ganesh Kumar, Jom Kuriakose, Anand Thyagachandran, Arun Kumar A, Ashish Seth, Lodagala Durga Prasad, Saish Jaiswal, Anusha Prakash, Hema Murthy
In the first system, the E2E model is trained on the CLS representation, and we use a novel data-driven back-end to recover the native language script.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3