Vision and Language Pre-Trained Models

In the ALIGN method, visual and language representations are jointly trained from noisy image alt-text data. The image and text encoders are learned via contrastive loss (formulated as normalized softmax) that pushes the embeddings of the matched image-text pair together and pushing those of non-matched image-text pair apart. The model learns to align visual and language representations of the image and text pairs using the contrastive loss. The representations can be used for vision-only or vision-language task transfer. Without any fine-tuning, ALIGN powers zero-shot visual classification and cross-modal search including image-to-text search, text-to image search and even search with joint image+text queries.

Source: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Language Modelling 47 6.28%
Large Language Model 24 3.21%
Retrieval 20 2.67%
Image Generation 19 2.54%
Semantic Segmentation 17 2.27%
Domain Adaptation 15 2.01%
Question Answering 14 1.87%
Recommendation Systems 13 1.74%
Decision Making 13 1.74%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories