no code implementations • 1 Apr 2024 • Chuyi Shang, Amos You, Sanjay Subramanian, Trevor Darrell, Roei Herzig
Specifically, we propose TraveLER, a model that can create a plan to "Traverse" through the video, ask questions about individual frames to "Locate" and store key information, and then "Evaluate" if there is enough information to answer the question.
1 code implementation • 28 Dec 2023 • Dantong Niu, Xudong Wang, Xinyang Han, Long Lian, Roei Herzig, Trevor Darrell
Several unsupervised image segmentation approaches have been proposed which eliminate the need for dense manually-annotated segmentation masks; current models separately handle either semantic segmentation (e. g., STEGO) or class-agnostic instance segmentation (e. g., CutLER), but not both (i. e., panoptic segmentation).
Ranked #1 on Unsupervised Panoptic Segmentation on COCO val2017
no code implementations • 4 Dec 2023 • Jiaxin Ge, Sanjay Subramanian, Baifeng Shi, Roei Herzig, Trevor Darrell
Visual Programming (VP) has emerged as a powerful framework for Visual Question Answering (VQA).
no code implementations • 29 Nov 2023 • Dantong Niu, Amir Bar, Roei Herzig, Trevor Darrell, Anna Rohrbach
Existing video-based action recognition systems typically require dense annotation and struggle in environments when there is significant distribution shift relative to the training data.
1 code implementation • 27 Nov 2023 • Chancharik Mitra, Brandon Huang, Trevor Darrell, Roei Herzig
The combination of strong visual backbones and Large Language Model (LLM) reasoning has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range of vision and language (VL) tasks.
Ranked #30 on Visual Reasoning on Winoground
no code implementations • 10 May 2023 • Roei Herzig, Alon Mendelson, Leonid Karlinsky, Assaf Arbelle, Rogerio Feris, Trevor Darrell, Amir Globerson
For the visual side, we incorporate a special "SG Component" in the image transformer trained to predict SG information, while for the textual side, we utilize SGs to generate fine-grained captions that highlight different compositional aspects of the scene.
Ranked #24 on Visual Reasoning on Winoground
1 code implementation • CVPR 2023 • Sivan Doveh, Assaf Arbelle, Sivan Harary, Eli Schwartz, Roei Herzig, Raja Giryes, Rogerio Feris, Rameswar Panda, Shimon Ullman, Leonid Karlinsky
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks.
no code implementations • 8 Dec 2022 • Roei Herzig, Ofir Abramovich, Elad Ben-Avraham, Assaf Arbelle, Leonid Karlinsky, Ariel Shamir, Trevor Darrell, Amir Globerson
In this work, we propose an approach to leverage synthetic scene data for improving video understanding.
1 code implementation • 21 Nov 2022 • Sivan Doveh, Assaf Arbelle, Sivan Harary, Rameswar Panda, Roei Herzig, Eli Schwartz, Donghyun Kim, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid Karlinsky
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks.
1 code implementation • 8 Sep 2022 • Amit Alfassy, Assaf Arbelle, Oshri Halimi, Sivan Harary, Roei Herzig, Eli Schwartz, Rameswar Panda, Michele Dolfi, Christoph Auer, Kate Saenko, PeterW. J. Staar, Rogerio Feris, Leonid Karlinsky
However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e. g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail part of the data distribution of the huge datasets used for FM pre-training.
Ranked #1 on Image-to-Text Retrieval on FETA Car-Manuals
no code implementations • 15 Jun 2022 • Elad Ben-Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson
First, as both images and videos contain structured information, we enrich a transformer model with a set of \emph{object tokens} that can be used across images and videos.
Point- of-no-return (PNR) temporal localization Temporal Localization
no code implementations • 13 Jun 2022 • Elad Ben-Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson
We explore a particular instantiation of scene structure, namely a \emph{Hand-Object Graph}, consisting of hands and objects with their locations as nodes, and physical relations of contact/no-contact as edges.
1 code implementation • CVPR 2022 • Sivan Harary, Eli Schwartz, Assaf Arbelle, Peter Staar, Shady Abu-Hussein, Elad Amrani, Roei Herzig, Amit Alfassy, Raja Giryes, Hilde Kuehne, Dina Katabi, Kate Saenko, Rogerio Feris, Leonid Karlinsky
The ability to generalize learned representations across significantly different visual domains, such as between real photos, clipart, paintings, and sketches, is a fundamental capacity of the human visual system.
1 code implementation • CVPR 2022 • Roei Herzig, Elad Ben-Avraham, Karttikeya Mangalam, Amir Bar, Gal Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson
In this work, we present Object-Region Video Transformers (ORViT), an \emph{object-centric} approach that extends video transformer layers with a block that directly incorporates object representations.
Ranked #6 on Action Recognition on Diving-48
1 code implementation • CVPR 2022 • Amir Bar, Xin Wang, Vadim Kantorov, Colorado J Reed, Roei Herzig, Gal Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson
Recent self-supervised pretraining methods for object detection largely focus on pretraining the backbone of the object detector, neglecting key parts of detection architecture.
Ranked #1 on Few-Shot Object Detection on COCO 2017
no code implementations • 30 Sep 2020 • Achiya Jerbi, Roei Herzig, Jonathan Berant, Gal Chechik, Amir Globerson
In this work, we argue that captions contain much richer information about the image, including attributes of objects and their relations.
1 code implementation • 27 Jun 2020 • Amir Bar, Roei Herzig, Xiaolong Wang, Anna Rohrbach, Gal Chechik, Trevor Darrell, Amir Globerson
Our generative model for this task (AG2Vid) disentangles motion and appearance features, and by incorporating a scheduling mechanism for actions facilitates a timely and coordinated video generation.
1 code implementation • CVPR 2020 • Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, Trevor Darrell
Human action is naturally compositional: humans can easily recognize and perform actions with objects that are different from those used in training demonstrations.
2 code implementations • ECCV 2020 • Roei Herzig, Amir Bar, Huijuan Xu, Gal Chechik, Trevor Darrell, Amir Globerson
Generating realistic images of complex visual scenes becomes challenging when one wishes to control the structure of the generated images.
Ranked #3 on Layout-to-Image Generation on Visual Genome 256x256
1 code implementation • 1 May 2019 • Eli Brosh, Matan Friedmann, Ilan Kadar, Lev Yitzhak Lavy, Elad Levi, Shmuel Rippa, Yair Lempert, Bruno Fernandez-Ruiz, Roei Herzig, Trevor Darrell
We propose a hybrid coarse-to-fine approach that leverages visual and GPS location cues.
5 code implementations • CVPR 2019 • Eran Goldman, Roei Herzig, Aviv Eisenschtat, Oria Ratzon, Itsik Levi, Jacob Goldberger, Tal Hassner
We propose a novel, deep-learning based method for precise object detection, designed for such challenging settings.
Ranked #4 on Dense Object Detection on SKU-110K
1 code implementation • 26 Feb 2019 • Moshiko Raboh, Roei Herzig, Gal Chechik, Jonathan Berant, Amir Globerson
In many domains, it is preferable to train systems jointly in an end-to-end manner, but SGs are not commonly used as intermediate components in visual reasoning systems because being discrete and sparse, scene-graph representations are non-differentiable and difficult to optimize.
1 code implementation • 4 Dec 2018 • Roei Herzig, Elad Levi, Huijuan Xu, Hang Gao, Eli Brosh, Xiaolong Wang, Amir Globerson, Trevor Darrell
Events defined by the interaction of objects in a scene are often of critical importance; yet important events may have insufficient labeled examples to train a conventional deep model to generalize to future object appearance.
1 code implementation • NeurIPS 2018 • Roei Herzig, Moshiko Raboh, Gal Chechik, Jonathan Berant, Amir Globerson
Machine understanding of complex images is a key goal of artificial intelligence.