FocusCLIP: Multimodal Subject-Level Guidance for Zero-Shot Transfer in Human-Centric Tasks
We propose FocusCLIP, integrating subject-level guidance--a specialized mechanism for target-specific supervision--into the CLIP framework for improved zero-shot transfer on human-centric tasks. Our novel contributions enhance CLIP on both the vision and text sides. On the vision side, we incorporate ROI heatmaps emulating human visual attention mechanisms to emphasize subject-relevant image regions. On the text side, we introduce human pose descriptions to provide rich contextual information. For human-centric tasks, FocusCLIP is trained with images from the MPII Human Pose dataset. The proposed approach surpassed CLIP by an average of 8.61% across five previously unseen datasets covering three human-centric tasks. FocusCLIP achieved an average accuracy of 33.65% compared to 25.04% by CLIP. We observed a 3.98% improvement in activity recognition, a 14.78% improvement in age classification, and a 7.06% improvement in emotion recognition. Moreover, using our proposed single-shot LLM prompting strategy, we release a high-quality MPII Pose Descriptions dataset to encourage further research in multimodal learning for human-centric tasks. Furthermore, we also demonstrate the effectiveness of our subject-level supervision on non-human-centric tasks. FocusCLIP shows a 2.47% improvement over CLIP in zero-shot bird classification using the CUB dataset. Our findings emphasize the potential of integrating subject-level guidance with general pretraining methods for enhanced downstream performance.
PDF AbstractDatasets
Introduced in the Paper:
MPII Human Pose DescriptionsUsed in the Paper:
CUB-200-2011 MPII UTKFace MPII Human Pose FER+ EMOTIC LAGENDATask | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Emotion Recognition | EMOTIC | FocusCLIP | Top-3 Accuracy (%) | 13.73 | # 1 | |
Age Classification | EMOTIC | CLIP | Top-1 Accuracy (%) | 37.56 | # 2 | |
Age Classification | EMOTIC | FocusCLIP | Top-1 Accuracy (%) | 41.80 | # 1 | |
Activity Recognition | Stanford40 | CLIP | Top-3 Accuracy (%) | 6.49 | # 2 | |
Activity Recognition | Stanford40 | FocusCLIP | Top-3 Accuracy (%) | 10.47 | # 1 |