no code implementations • ECCV 2020 • Sangmin Lee, Jung Uk Kim, Hak Gu Kim, Seongyeop Kim, Yong Man Ro
In this paper, we propose a novel symptom-aware cybersickness assessment network (SACA Net) that quantifies physical symptom levels for assessing cybersickness of individual viewers.
no code implementations • 30 Apr 2024 • Sungjune Park, Hyunjun Kim, Yong Man Ro
Therefore, in this paper, we propose a novel approach to construct versatile pedestrian knowledge bank containing representative pedestrian knowledge which can be applicable to various detection frameworks and adopted in diverse scenes.
no code implementations • 22 Mar 2024 • Taeheon Kim, Sangyun Chung, Damin Yeom, Youngjoon Yu, Hak Gu Kim, Yong Man Ro
Specifically, we generate text descriptions of the pedestrian in each RGB and thermal modality and design a Multispectral Chain-of-Thought (MSCoT) prompting, which models a step-by-step process to facilitate cross-modal reasoning at the semantic level and perform accurate detection.
1 code implementation • 20 Mar 2024 • Junho Kim, Yeon Ju Kim, Yong Man Ro
This paper presents a way of enhancing the reliability of Large Multimodal Models (LMMs) in addressing hallucination effects, where models generate incorrect or unrelated responses.
1 code implementation • 12 Mar 2024 • Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro
Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models.
Ranked #27 on Visual Question Answering on MM-Vet
no code implementations • 7 Mar 2024 • Seunghee Han, Se Jin Park, Chae Won Kim, Yong Man Ro
We devise completeness loss and consistency loss based on semantic similarity scores.
1 code implementation • 2 Mar 2024 • Taeheon Kim, Sebin Shin, Youngjoon Yu, Hak Gu Kim, Yong Man Ro
As a result, multispectral pedestrian detectors show poor generalization ability on examples beyond this statistical correlation, such as ROTX data.
1 code implementation • 25 Feb 2024 • Minsu Kim, Jee-weon Jung, Hyeongseop Rha, Soumi Maiti, Siddhant Arora, Xuankai Chang, Shinji Watanabe, Yong Man Ro
We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text.
1 code implementation • 23 Feb 2024 • Jeong Hun Yeo, Seunghee Han, Minsu Kim, Yong Man Ro
In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements.
Ranked #4 on Lipreading on LRS3-TED (using extra training data)
1 code implementation • 17 Feb 2024 • Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro
Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks.
Ranked #35 on Visual Question Answering on MM-Vet
no code implementations • 18 Jan 2024 • Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Se Jin Park, Yong Man Ro
By using the visual speech units as the inputs of our system, we pre-train the model to predict corresponding text outputs on massive multilingual data constructed by merging several VSR databases.
1 code implementation • 5 Dec 2023 • Jeongsoo Choi, Se Jin Park, Minsu Kim, Yong Man Ro
To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A.
1 code implementation • 2 Nov 2023 • Sungjune Park, Hyunjun Kim, Yong Man Ro
The obtained knowledge elements are adaptable to various detection frameworks, so that we can provide plentiful appearance information by integrating the language-derived appearance elements with visual cues within a detector.
1 code implementation • 11 Oct 2023 • Junho Kim, Byung-Kwan Lee, Yong Man Ro
Unsupervised semantic segmentation aims to achieve high-quality semantic grouping without human-labeled annotations.
Ranked #1 on Unsupervised Semantic Segmentation on COCO-Stuff-81
no code implementations • 15 Sep 2023 • Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe, Yong Man Ro
To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp.
no code implementations • 15 Sep 2023 • Jeong Hun Yeo, Minsu Kim, Shinji Watanabe, Yong Man Ro
Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the different languages without human intervention.
no code implementations • 23 Aug 2023 • Se Jin Park, Joanna Hong, Minsu Kim, Yong Man Ro
We contribute a new large-scale 3D facial mesh dataset, 3D-HDTF to enable the synthesis of variations in identities, poses, and facial motions of 3D face mesh.
no code implementations • ICCV 2023 • Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Yong Man Ro
In order to mitigate the challenge, we try to learn general speech knowledge, the ability to model lip movements, from a high-resource language through the prediction of speech units.
no code implementations • 15 Aug 2023 • Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe Kim, Yong Man Ro
Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements.
1 code implementation • ICCV 2023 • Jeongsoo Choi, Joanna Hong, Yong Man Ro
In doing so, the rich speaker embedding information can be produced solely from input visual information, and the extra audio information is not necessary during the inference time.
1 code implementation • 3 Aug 2023 • Minsu Kim, Jeongsoo Choi, Dahun Kim, Yong Man Ro
A single pre-trained model with UTUT can be employed for diverse multilingual speech- and text-related tasks, such as Speech-to-Speech Translation (STS), multilingual Text-to-Speech Synthesis (TTS), and Text-to-Speech Translation (TTST).
1 code implementation • ICCV 2023 • Byung-Kwan Lee, Junho Kim, Yong Man Ro
Adversarial examples derived from deliberately crafted perturbations on visual inputs can easily harm decision process of deep neural networks.
no code implementations • 28 Jun 2023 • Jeongsoo Choi, Minsu Kim, Se Jin Park, Yong Man Ro
The visual speaker embedding is derived from a single target face image and enables improved mapping of input text to the learned audio latent space by incorporating the speaker characteristics inherent in the audio.
no code implementations • 27 Jun 2023 • Hong Joo Lee, Yong Man Ro
With the class-wise robust features, the model explicitly learns adversarially robust features through the proposed robust proxy learning framework.
no code implementations • 27 Jun 2023 • Hong Joo Lee, Youngjoon Yu, Yong Man Ro
Different from the previous approaches, in this paper, we propose a new approach to improve the adversarial robustness by using an external signal rather than model parameters.
1 code implementation • 31 May 2023 • Jeongsoo Choi, Minsu Kim, Yong Man Ro
Therefore, the proposed L2S model is trained to generate multiple targets, mel-spectrogram and speech units.
no code implementations • 31 May 2023 • Se Jin Park, Minsu Kim, Jeongsoo Choi, Yong Man Ro
The contextualized lip motion unit then guides the latter in synthesizing a target identity with context-aware lip motion.
no code implementations • 8 May 2023 • Jeong Hun Yeo, Minsu Kim, Yong Man Ro
Visual Speech Recognition (VSR) is a task to predict a sentence or word from lip movements.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
1 code implementation • CVPR 2023 • Joanna Hong, Minsu Kim, Jeongsoo Choi, Yong Man Ro
Thus, we firstly analyze that the previous AVSR models are not indeed robust to the corruption of multimodal input streams, the audio and the visual inputs, compared to uni-modal models.
1 code implementation • CVPR 2023 • Junho Kim, Byung-Kwan Lee, Yong Man Ro
The origin of adversarial examples is still inexplicable in research fields, and it arouses arguments from various viewpoints, albeit comprehensive investigations.
no code implementations • 27 Feb 2023 • Minsu Kim, Chae Won Kim, Yong Man Ro
The proposed DVFA can align the input transcription (i. e., sentence) with the talking face video without accessing the speech audio.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
3 code implementations • 17 Feb 2023 • Minsu Kim, Joanna Hong, Yong Man Ro
To this end, we design multi-task learning that guides the model using multimodal supervision, i. e., text and audio, to complement the insufficient word representations of acoustic feature reconstruction loss.
no code implementations • 16 Feb 2023 • Minsu Kim, Hyung-Il Kim, Yong Man Ro
As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements, and this makes the VSR models show degraded performance when they are applied to unseen speakers.
no code implementations • 2 Nov 2022 • Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, Yong Man Ro
It stores lip motion features from sequential ground truth images in the value memory and aligns them with corresponding audio features so that they can be retrieved using audio input at inference time.
no code implementations • 21 Oct 2022 • Minsu Kim, Youngjoon Yu, Sungjune Park, Yong Man Ro
The proposed meta input can be optimized with a small number of testing data only by considering the relation between testing input data and its output prediction.
no code implementations • 15 Sep 2022 • Hyung-Il Kim, Kimin Yun, Yong Man Ro
This is mainly attributed to the mismatch between training and testing sets.
1 code implementation • 9 Aug 2022 • Minsu Kim, Hyunjun Kim, Yong Man Ro
In this paper, to remedy the performance degradation of lip reading model on unseen speakers, we propose a speaker-adaptive lip reading method, namely user-dependent padding.
1 code implementation • 13 Jul 2022 • Joanna Hong, Minsu Kim, Daehun Yoo, Yong Man Ro
The enhanced audio features are fused with the visual features and taken to an encoder-decoder model composed of Conformer and Transformer for speech recognition.
no code implementations • 15 Jun 2022 • Joanna Hong, Minsu Kim, Yong Man Ro
Thus, the proposed framework brings the advantage of synthesizing the speech containing the right content even with the silent talking face video of an unseen subject.
no code implementations • 27 Apr 2022 • Youngjoon Yu, Hong Joo Lee, Hakmin Lee, Yong Man Ro
Person detection has attracted great attention in the computer vision area and is an imperative element in human-centric computer vision.
1 code implementation • NeurIPS 2021 • Junho Kim, Byung-Kwan Lee, Yong Man Ro
Adversarial examples, generated by carefully crafted perturbation, have attracted considerable attention in research fields.
1 code implementation • CVPR 2022 • Byung-Kwan Lee, Junho Kim, Yong Man Ro
Adversarial examples provoke weak reliability and potential security issues in deep neural networks.
1 code implementation • ICCV 2021 • Minsu Kim, Joanna Hong, Se Jin Park, Yong Man Ro
By learning the interrelationship through the associative bridge, the proposed bridging framework is able to obtain the target modal representations inside the memory network, even with the source modal input only, and it provides rich information for its downstream tasks.
Ranked #3 on Lipreading on CAS-VSR-W1k (LRW-1000)
1 code implementation • The AAAI Conference on Artificial Intelligence (AAAI) 2022 • Minsu Kim, Jeong Hun Yeo, Yong Man Ro
With the multi-head key memories, MVM extracts possible candidate audio features from the memory, which allows the lip reading model to consider the possibility of which pronunciations can be represented from the input lip movement.
Ranked #2 on Lipreading on CAS-VSR-W1k (LRW-1000)
1 code implementation • NeurIPS 2021 • Minsu Kim, Joanna Hong, Yong Man Ro
In this paper, we propose a novel lip-to-speech generative adversarial network, Visual Context Attentional GAN (VCA-GAN), which can jointly model local and global lip movements during speech synthesis.
no code implementations • CVPR 2022 • Sangmin Lee, Hyung-Il Kim, Yong Man Ro
Existing sound and image representation learning methods necessarily require a large number of sound and image with corresponding pairs.
1 code implementation • IEEE/ACM Transactions on Audio, Speech, and Language Processing 2021 • Joanna Hong, Minsu Kim, Se Jin Park, Yong Man Ro
Our key contributions are: (1) proposing the Visual Voice memory that brings rich information of audio that complements the visual features, thus producing high-quality speech from silent video, and (2) enabling multi-speaker and unseen speaker training by memorizing auditory features and the corresponding visual features.
no code implementations • 14 Apr 2021 • Hak Gu Kim, Minho Park, Sangmin Lee, Seongyeop Kim, Yong Man Ro
For a human expert, the depth adjustment procedure is a sequence of iterative decision making.
no code implementations • 14 Apr 2021 • Hak Gu Kim, Sangmin Lee, Seongyeop Kim, Heoun-taek Lim, Yong Man Ro
To make better understanding of VR sickness, it is required to predict and provide the level of major symptoms of VR sickness rather than overall degree of VR sickness.
1 code implementation • CVPR 2021 • Sangmin Lee, Hak Gu Kim, Dae Hwi Choi, Hyung-Il Kim, Yong Man Ro
Our work addresses long-term motion context issues for predicting future frames.
Ranked #1 on Video Prediction on KTH (Cond metric)
no code implementations • ICCV 2021 • Jung Uk Kim, Sungjune Park, Yong Man Ro
The purpose of the proposed large-scale embedding learning is to memorize and recall the large-scale pedestrian appearance via the LPR Memory.
1 code implementation • 1 Jan 2021 • Byung-Kwan Lee, Youngjoon Yu, Yong Man Ro
Recent works have applied Bayesian Neural Network (BNN) to adversarial training, and shown the improvement of adversarial robustness via the BNN's strength of stochastic gradient defense.
no code implementations • 16 Jul 2020 • Joanna Hong, Jung Uk Kim, Sangmin Lee, Yong Man Ro
Recent advances in facial expression synthesis have shown promising results using diverse expression representations including facial action units.
no code implementations • 22 May 2020 • Youngjoon Yu, Hong Joo Lee, Byeong Cheon Kim, Jung Uk Kim, Yong Man Ro
The success of multimodal data fusion in deep learning appears to be attributed to the use of complementary in-formation between multiple input data.
no code implementations • 21 May 2020 • Hakmin Lee, Hong Joo Lee, Seong Tae Kim, Yong Man Ro
After the ensemble models are trained, it can hide the gradient efficiently and avoid the gradient-based attack by the random layer sampling method.
no code implementations • 21 May 2020 • Byeong Cheon Kim, Jung Uk Kim, Hakmin Lee, Yong Man Ro
Through the comprehensive experimental results and analysis, this paper presents the inherent property of adversarial robustness in the autoencoders.
no code implementations • 21 May 2020 • Hong Joo Lee, Seong Tae Kim, Hakmin Lee, Nassir Navab, Yong Man Ro
Experimental results show that the proposed method could provide useful uncertainty information by Bayesian approximation with the efficient ensemble model generation and improve the predictive performance.
no code implementations • 2 Jul 2019 • Minho Park, Hak Gu Kim, Yong Man Ro
Generating realistic looking images with large variations (e. g., large spatial deformations and large pose change), however, is very challenging.
no code implementations • 10 Jun 2019 • Hyebin Lee, Seong Tae Kim, Yong Man Ro
The ambiguity of the decision-making process has been pointed out as the main obstacle to applying the deep learning-based method in a practical way in spite of its outstanding performance.
no code implementations • 16 Nov 2018 • Wissam J. Baddar, Yong Man Ro
The effectiveness of the proposed mode variational LSTM is verified using the facial expression recognition task.
Facial Expression Recognition Facial Expression Recognition (FER)
no code implementations • 17 Sep 2018 • Jae-Hyeok Lee, Seong Tae Kim, Hakmin Lee, Yong Man Ro
In order to learn deep network model to be well-behaved in bio-image computing fields, a lot of labeled data is required.
no code implementations • 23 May 2018 • Seong Tae Kim, Hakmin Lee, Hak Gu Kim, Yong Man Ro
In this paper, we investigate interpretability in CADx with the proposed interpretable CADx (ICADx) framework.
no code implementations • 23 Apr 2018 • Sangmin Lee, Hak Gu Kim, Yong Man Ro
In this paper, we propose a novel abnormal event detection method with spatio-temporal adversarial networks (STAN).
Ranked #17 on Anomaly Detection on ShanghaiTech
no code implementations • 11 Apr 2018 • Heoun-taek Lim, Hak Gu Kim, Yong Man Ro
The proposed human perception guider criticizes the predicted quality score of the predictor with the human perceptual score using adversarial learning.
no code implementations • 11 Apr 2018 • Hak Gu Kim, Wissam J. Baddar, Heoun-taek Lim, Hyunwook Jeong, Yong Man Ro
This paper proposes a new objective metric of exceptional motion in VR video contents for VR sickness assessment.
no code implementations • 10 Dec 2017 • Wissam J. Baddar, Geonmo Gu, Sangmin Lee, Yong Man Ro
The spatial constructs of a generated video sequence are acquired from the target image.
no code implementations • ECCV 2018 • Seong Tae Kim, Yong Man Ro
In this paper, a novel deep learning approach, named facial dynamics interpreter network, has been proposed to interpret the important relations between local dynamics for estimating facial traits from expression sequence.
no code implementations • 29 Nov 2017 • Wissam J. Baddar, Yong Man Ro
At test time, most spatio-temporal encoding methods assume that a temporally segmented sequence is fed to a learned model, which could require the prediction to wait until the full sequence is available to an auxiliary task that performs the temporal segmentation.
no code implementations • 28 Nov 2017 • Geonmo Gu, Seong Tae Kim, Kihyun Kim, Wissam J. Baddar, Yong Man Ro
through a generative model is helpful in addressing the lack of training data.
no code implementations • 11 Aug 2017 • Jung Uk Kim, Hak Gu Kim, Yong Man Ro
In this paper, we propose a novel medical image segmentation using iterative deep learning framework.
no code implementations • 10 Aug 2017 • Hak Gu Kim, Yeoreum Choi, Yong Man Ro
This paper presents a new approach of transfer learning-based medical image classification to mitigate insufficient labeled data problem in medical domain.
no code implementations • 31 Jul 2017 • Tae Kwan Lee, Wissam J. Baddar, Seong Tae Kim, Yong Man Ro
Our classification results on Multi-PIE dataset for facial expression recognition and CIFAR-10 dataset for object classification reveal that the compact CNN with the proposed logarithmic filter grouping scheme outperforms the same network with the uniform filter grouping in terms of accuracy and parameter efficiency.
no code implementations • 31 May 2017 • Seong Tae Kim, Yong Man Ro
In order to improve the effectiveness of the learning with instructional video, observation and evaluation of the activity are required.