no code implementations • 16 Nov 2020 • Aman Chadha, Gurneet Arora, Navpreet Kaloty
Most prior art in visual understanding relies solely on analyzing the "what" (e. g., event recognition) and "where" (e. g., event localization), which in some cases, fails to describe correct contextual relationships between events or leads to incorrect underlying visual attention.
Ranked #4 on Video Question Answering on TVQA