Refining Multimodal Representations using a modality-centric self-supervised module
Tasks that rely on multi-modal information typically include a fusion module that combines information from different modalities. In this work, we develop a self-supervised module, called REFINER, that refines multimodal representations using a decoding/defusing module applied downstream of the fused embedding. REFINER imposes a modality-centric responsibility condition, ensuring that both unimodal and fused representations are strongly encoded in the latent fusion space. Our approach provides both stronger generalization and reduced over-fitting. REFINER is only applied during training time keeping the inference time intact. The modular nature of REFINER lends itself to be combined with different fusion architectures easily. We demonstrate the power of REFINER on three datasets over powerful baseline fusion modules, and further show that they give a significant performance boost in few shot learning tasks.
PDF Abstract