Who's Waldo is a dataset of 270K image–caption pairs, depicting interactions of people, that is automatically mined from Wikimedia Commons. It is a benchmark dataset for person-centric visual grounding, the problem of linking between people named in a caption and people pictured in an image.
Paper | Code | Results | Date | Stars |
---|