no code implementations • 24 Apr 2024 • Tushar Nagarajan, Lorenzo Torresani
Comparing a user video to a reference how-to video is a key requirement for AR/VR technology delivering personalized assistance tailored to the user's progress.
no code implementations • 20 Feb 2024 • Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius
We utilize a curriculum learning training scheme to learn the hierarchical structure of videos, starting from clip-level captions describing atomic actions, then focusing on segment-level descriptions, and concluding with generating summaries for hour-long videos.
1 code implementation • 30 Nov 2023 • Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei HUANG, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge.
no code implementations • 24 Jul 2023 • Reuben Tan, Matthias De Lange, Michael Iuzzolino, Bryan A. Plummer, Kate Saenko, Karl Ridgeway, Lorenzo Torresani
To alleviate this issue, we propose Multiscale Video Pretraining (MVP), a novel self-supervised pretraining approach that learns robust representations for forecasting by learning to predict contextualized representations of future video clips over multiple timescales.
no code implementations • ICCV 2023 • Effrosyni Mavroudi, Triantafyllos Afouras, Lorenzo Torresani
To deal with the scarcity of labeled data at scale, we source the step descriptions from a language knowledge base (wikiHow) containing instructional articles for a large variety of procedural tasks.
no code implementations • 9 Mar 2023 • Tarun Kalluri, Weiyao Wang, Heng Wang, Manmohan Chandraker, Lorenzo Torresani, Du Tran
Many top-down architectures for instance segmentation achieve significant success when trained and tested on pre-defined closed-world taxonomy.
no code implementations • 16 Feb 2023 • Raghav Goyal, Effrosyni Mavroudi, Xitong Yang, Sainbayar Sukhbaatar, Leonid Sigal, Matt Feiszli, Lorenzo Torresani, Du Tran
Video understanding tasks take many forms, from action detection to visual query localization and spatio-temporal grounding of sentences.
no code implementations • 3 Feb 2023 • Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani
With no modification to the baseline architectures, our proposed approach achieves competitive performance on two Ego4D challenges, ranking the 1st in the talking to me challenge and the 3rd in the PNR keyframe localization challenge.
1 code implementation • CVPR 2023 • Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman
Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text.
Ranked #3 on Action Recognition on Charades-Ego
no code implementations • 5 Jan 2023 • Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman
Narrated ''how-to'' videos have emerged as a promising data source for a wide range of learning problems, from learning visual representations to training robot policies.
no code implementations • ICCV 2023 • Huiyu Wang, Mitesh Kumar Singh, Lorenzo Torresani
We find that this renders exocentric transferring unnecessary by showing remarkably strong results achieved by this simple Ego-Only approach on three established egocentric video datasets: Ego4D, EPIC-Kitchens-100, and Charades-Ego.
no code implementations • CVPR 2023 • Xitong Yang, Fu-Jen Chu, Matt Feiszli, Raghav Goyal, Lorenzo Torresani, Du Tran
In this paper, we propose to study these problems in a joint framework for long video understanding.
no code implementations • CVPR 2023 • Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani
Different video understanding tasks are typically treated in isolation, and even with distinct types of curated data (e. g., classifying sports in one dataset, tracking animals in another).
no code implementations • 13 Sep 2022 • Joseph DiPalma, Lorenzo Torresani, Saeed Hassanpour
These findings suggest that HistoPerm can be a valuable tool for improving representation learning of histopathology features when access to labeled data is limited and can lead to whole-slide classification results that are comparable to or superior to fully-supervised methods.
no code implementations • CVPR 2022 • Jue Wang, Lorenzo Torresani
Video transformers have recently emerged as an effective alternative to convolutional networks for action classification.
no code implementations • 28 Jan 2022 • Jerry Wei, Lorenzo Torresani, Jason Wei, Saeed Hassanpour
Moreover, we find that using model confidence as a proxy for annotator agreement also improves calibration and accuracy, suggesting that datasets without multiple annotators can still benefit from our proposed label smoothing methods via our proposed confidence-aware label smoothing methods.
1 code implementation • CVPR 2022 • Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, Lorenzo Torresani
In this paper we consider the problem of classifying fine-grained, multi-step activities (e. g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes.
Ranked #3 on Video Classification on Breakfast
1 code implementation • 6 Dec 2021 • Yiren Jian, Lorenzo Torresani
At the same time, training a simple linear classifier on top of "frozen" features learned from the large labeled dataset fails to adapt the model to the properties of the novel classes, effectively inducing underfitting.
7 code implementations • CVPR 2022 • Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei HUANG, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite.
no code implementations • CVPR 2022 • Jue Wang, Gedas Bertasius, Du Tran, Lorenzo Torresani
Our approach, named Long-Short Temporal Contrastive Learning (LSTCL), enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
no code implementations • CVPR 2021 • Xitong Yang, Haoqi Fan, Lorenzo Torresani, Larry Davis, Heng Wang
The standard way of training video models entails sampling at each iteration a single clip from a video and optimizing the clip prediction with respect to the video-level label.
no code implementations • 11 Feb 2021 • Leda Sari, Kritika Singh, Jiatong Zhou, Lorenzo Torresani, Nayan Singhal, Yatharth Saraf
Although speaker verification has conventionally been an audio-only task, some practical applications provide both audio and visual streams of input.
13 code implementations • 9 Feb 2021 • Gedas Bertasius, Heng Wang, Lorenzo Torresani
We present a convolution-free approach to video classification built exclusively on self-attention over space and time.
Ranked #1 on Video Question Answering on Howto100M-QA
no code implementations • 29 Jan 2021 • Jerry Wei, Arief Suriawinata, Bing Ren, Xiaoying Liu, Mikhail Lisovsky, Louis Vaickus, Charles Brown, Michael Baker, Naofumi Tomita, Lorenzo Torresani, Jason Wei, Saeed Hassanpour
With the rise of deep learning, there has been increased interest in using neural networks for histopathology image analysis, a field that investigates the properties of biopsy or resected specimens traditionally manually examined under a microscope by pathologists.
no code implementations • CVPR 2021 • Xudong Lin, Gedas Bertasius, Jue Wang, Shih-Fu Chang, Devi Parikh, Lorenzo Torresani
We present \textsc{Vx2Text}, a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
no code implementations • 16 Jan 2021 • Maxwell Mbabilla Aladago, Lorenzo Torresani
By selecting a weight among a fixed set of random values for each individual connection, our method uncovers combinations of random weights that match the performance of traditionally-trained networks of the same capacity.
no code implementations • 11 Jan 2021 • Joseph DiPalma, Arief A. Suriawinata, Laura J. Tafe, Lorenzo Torresani, Saeed Hassanpour
Our results show that a combination of KD and self-supervision allows the student model to approach, and in some cases, surpass the classification accuracy of the teacher, while being much more efficient.
no code implementations • 29 Sep 2020 • Jerry Wei, Arief Suriawinata, Bing Ren, Xiaoying Liu, Mikhail Lisovsky, Louis Vaickus, Charles Brown, Michael Baker, Mustafa Nasir-Moin, Naofumi Tomita, Lorenzo Torresani, Jason Wei, Saeed Hassanpour
Based on the nature of histopathology images, a range of difficulty inherently exists among examples, and, since medical datasets are often labeled by multiple annotators, annotator agreement can be used as a natural proxy for the difficulty of a given example.
no code implementations • NeurIPS 2020 • Gedas Bertasius, Lorenzo Torresani
A fully-supervised approach to recognizing object states and their contexts in the real-world is unfortunately marred by the long-tailed, open-ended distribution of the data, which would effectively require massive amounts of annotations to capture the appearance of objects in all their different forms.
1 code implementation • 9 Jul 2020 • Yongqin Xian, Bruno Korbar, Matthijs Douze, Lorenzo Torresani, Bernt Schiele, Zeynep Akata
Few-shot learning aims to recognize novel classes from a few examples.
no code implementations • 12 Jun 2020 • Bruno Korbar, Fabio Petroni, Rohit Girdhar, Lorenzo Torresani
With the advent of large-scale multimodal video datasets, especially sequences with audio or transcribed speech, there has been a growing interest in self-supervised learning of video representations.
no code implementations • 1 Mar 2020 • Jun Han, Fan Ding, Xianglong Liu, Lorenzo Torresani, Jian Peng, Qiang Liu
In addition, such transform can be straightforwardly employed in gradient-free kernelized Stein discrepancy to perform goodness-of-fit (GOF) test on discrete distributions.
1 code implementation • CVPR 2020 • Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, Lorenzo Torresani
In the face of the video data deluge, today's expensive clip-level classifiers are increasingly impractical.
Ranked #8 on Action Recognition on ActivityNet
no code implementations • CVPR 2020 • Gedas Bertasius, Lorenzo Torresani
We introduce a method for simultaneously classifying, segmenting and tracking object instances in a video sequence.
no code implementations • NeurIPS 2019 • Karim Ahmed, Lorenzo Torresani
Capsule networks have been shown to be powerful models for image classification, thanks to their ability to represent and capture viewpoint variations of an object.
1 code implementation • NeurIPS 2020 • Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, Du Tran
To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.
no code implementations • 19 Jul 2019 • Laura Sevilla-Lara, Shengxin Zha, Zhicheng Yan, Vedanuj Goswami, Matt Feiszli, Lorenzo Torresani
However, in current video datasets it has been observed that action classes can often be recognized without any temporal information from a single frame of video.
no code implementations • 10 Jun 2019 • Yufei Wang, Du Tran, Lorenzo Torresani
It consists of a shared 2D spatial convolution followed by two parallel point-wise convolutional layers, one devoted to images and the other one used for videos.
no code implementations • CVPR 2020 • Heng Wang, Du Tran, Lorenzo Torresani, Matt Feiszli
Motion is a salient cue to recognize actions in video.
Ranked #108 on Action Classification on Kinetics-400
3 code implementations • NeurIPS 2019 • Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani
To reduce the need for dense annotations, we propose a PoseWarper network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation.
Ranked #2 on Multi-Person Pose Estimation on PoseTrack2018 (using extra training data)
no code implementations • 10 Apr 2019 • Yang Wang, Vinh Tran, Gedas Bertasius, Lorenzo Torresani, Minh Hoai
This is a challenging task due to the subtlety of human actions in video and the co-occurrence of contextual elements.
no code implementations • ICCV 2019 • Bruno Korbar, Du Tran, Lorenzo Torresani
We demonstrate that the computational cost of action recognition on untrimmed videos can be dramatically reduced by invoking recognition only on these most salient clips.
Ranked #1 on Action Recognition on miniSports
7 code implementations • ICCV 2019 • Du Tran, Heng Wang, Lorenzo Torresani, Matt Feiszli
It is natural to ask: 1) if group convolution can help to alleviate the high computational cost of video classification networks; 2) what factors matter the most in 3D group convolutional networks; and 3) what are good computation/accuracy trade-offs with 3D group convolutional networks.
Ranked #1 on Action Recognition on Sports-1M
no code implementations • ICCV 2019 • Rohit Girdhar, Du Tran, Lorenzo Torresani, Deva Ramanan
In this work, we propose an alternative approach to learning video representations that require no semantically labeled videos and instead leverages the years of effort in collecting and labeling large and clean still-image datasets.
Ranked #72 on Action Recognition on HMDB-51 (using extra training data)
no code implementations • 11 Dec 2018 • Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani
Our network learns to spatially sample features from Frame B in order to maximize pose detection accuracy in Frame A.
no code implementations • ECCV 2018 • Jamie Ray, Heng Wang, Du Tran, YuFei Wang, Matt Feiszli, Lorenzo Torresani, Manohar Paluri
The videos retrieved by the search engines are then veried for correctness by human annotators.
no code implementations • ECCV 2018 • Karim Ahmed, Lorenzo Torresani
Such simple connectivity rules are unlikely to yield the optimal architecture for the given problem.
no code implementations • NeurIPS 2018 • Bruno Korbar, Du Tran, Lorenzo Torresani
There is a natural correlation between the visual and auditive elements of a video.
Ranked #7 on Self-Supervised Audio Classification on ESC-50
no code implementations • CVPR 2018 • De-An Huang, Vignesh Ramanathan, Dhruv Mahajan, Lorenzo Torresani, Manohar Paluri, Li Fei-Fei, Juan Carlos Niebles
The ability to capture temporal information has been critical to the development of video understanding models.
no code implementations • ECCV 2018 • Gedas Bertasius, Lorenzo Torresani, Jianbo Shi
We propose a Spatiotemporal Sampling Network (STSN) that uses deformable convolutions across time for object detection in videos.
1 code implementation • CVPR 2018 • Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, Du Tran
This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video.
Ranked #8 on Pose Tracking on PoseTrack2017 (using extra training data)
2 code implementations • ICCV 2019 • Hang Zhao, Antonio Torralba, Lorenzo Torresani, Zhicheng Yan
This paper presents a new large-scale dataset for recognition and temporal localization of human actions collected from Web videos.
Ranked #10 on Temporal Action Localization on HACS
20 code implementations • CVPR 2018 • Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann Lecun, Manohar Paluri
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition.
Ranked #3 on Action Recognition on Sports-1M
no code implementations • ICLR 2018 • Karim Ahmed, Lorenzo Torresani
While much of the work in the design of convolutional networks over the last five years has revolved around the empirical investigation of the importance of depth, filter sizes, and number of feature channels, recent studies have shown that branching, i. e., splitting the computation along parallel but distinct threads and then aggregating their outputs, represents a new promising dimension for significant improvements in performance.
no code implementations • NeurIPS 2017 • Mohammad Haris Baig, Vladlen Koltun, Lorenzo Torresani
We study the design of deep architectures for lossy image compression.
no code implementations • 20 Apr 2017 • Karim Ahmed, Lorenzo Torresani
We introduce an architecture for large-scale image categorization that enables the end-to-end learning of separate visual features for the different classes to distinguish.
no code implementations • 5 Mar 2017 • Bruno Korbar, Andrea M. Olofson, Allen P. Miraflor, Katherine M. Nicka, Matthew A. Suriawinata, Lorenzo Torresani, Arief A. Suriawinata, Saeed Hassanpour
In this work, we built an automatic image-understanding method that can accurately classify different types of colorectal polyps in whole-slide histology images to help pathologists with histopathological characterization and diagnosis of colorectal polyps.
no code implementations • 23 Jun 2016 • Du Tran, Maksim Bolonkin, Manohar Paluri, Lorenzo Torresani
Language has been exploited to sidestep the problem of defining video categories, by formulating video understanding as the task of captioning or description.
no code implementations • 20 Jun 2016 • Mohammad Haris Baig, Lorenzo Torresani
In the experiments we show that our proposed method outperforms traditional JPEG color coding by a large margin, producing colors that are nearly indistinguishable from the ground truth at the storage cost of just a few hundred bytes for high-resolution pictures!
no code implementations • 24 May 2016 • Gedas Bertasius, Qiang Liu, Lorenzo Torresani, Jianbo Shi
In this work, we present a new Local Perturb-and-MAP (locPMAP) framework that replaces the global optimization with a local optimization by exploiting our observed connection between locPMAP and the pseudolikelihood of the original CRF model.
no code implementations • CVPR 2017 • Gedas Bertasius, Lorenzo Torresani, Stella X. Yu, Jianbo Shi
It combines these two objectives via a novel random walk layer that enforces consistent spatial grouping in the deep layers of the network.
no code implementations • 20 Apr 2016 • Karim Ahmed, Mohammad Haris Baig, Lorenzo Torresani
The training of our "network of experts" is completely end-to-end: the partition of categories into disjoint subsets is learned simultaneously with the parameters of the network trunk and the experts are trained jointly by minimizing a single learning objective over all classes.
no code implementations • 27 Mar 2016 • Loris Bazzani, Hugo Larochelle, Lorenzo Torresani
In this work, we propose a spatiotemporal attentional model that learns where to look in a video directly from human fixation data.
no code implementations • 20 Nov 2015 • Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri
Over the last few years deep learning methods have emerged as one of the most prominent approaches for video analysis.
no code implementations • CVPR 2016 • Gedas Bertasius, Jianbo Shi, Lorenzo Torresani
To overcome these problems, we introduce a Boundary Neural Field (BNF), which is a global energy model integrating FCN predictions with boundary cues.
no code implementations • ICCV 2015 • Gedas Bertasius, Jianbo Shi, Lorenzo Torresani
We can view this process as a "Low-for-High" scheme, where low-level boundaries aid high-level vision tasks.
no code implementations • 19 Jan 2015 • Mohammad Haris Baig, Lorenzo Torresani
Crucially, the depth basis and the regression function are {\bf coupled} and jointly optimized by our learning scheme.
no code implementations • CVPR 2015 • Gedas Bertasius, Jianbo Shi, Lorenzo Torresani
This section of the network is applied to four different scales of the image input.
28 code implementations • ICCV 2015 • Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri
We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset.
Ranked #8 on Action Recognition on Sports-1M
Action Recognition In Videos Dynamic Facial Expression Recognition
no code implementations • 13 Sep 2014 • Loris Bazzani, Alessandro Bergamo, Dragomir Anguelov, Lorenzo Torresani
This paper introduces self-taught object localization, a novel approach that leverages deep convolutional networks trained for whole-image recognition to localize objects in images without additional human supervision, i. e., without using any ground-truth bounding boxes for training.
no code implementations • 20 Dec 2013 • Du Tran, Lorenzo Torresani
We show the generality of our approach by building our mid-level descriptors from two different low-level feature representations.
no code implementations • CVPR 2013 • Alessandro Bergamo, Sudipta N. Sinha, Lorenzo Torresani
In this paper we propose a new technique for learning a discriminative codebook for local feature descriptors, specifically designed for scalable landmark classification.
no code implementations • NeurIPS 2011 • Alessandro Bergamo, Lorenzo Torresani, Andrew W. Fitzgibbon
In contrast to previous approaches to learn compact codes, we optimize explicitly for (an upper bound on) classification performance.
no code implementations • NeurIPS 2010 • Alessandro Bergamo, Lorenzo Torresani
In this paper we investigate and compare methods that learn image classifiers by combining very few manually annotated examples (e. g., 1-10 images per class) and a large number of weakly-labeled Web photos retrieved using keyword-based image search.