Zero-shot Scene Classification (unified classes)
2 papers with code • 1 benchmarks • 1 datasets
This task has no description! Would you like to contribute one?
Most implemented papers
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M.
ImageBind: One Embedding Space To Bind Them All
We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together.