1 code implementation • 16 Apr 2024 • Ke Zhu, Liang Zhao, Zheng Ge, Xiangyu Zhang
We generate chosen and rejected responses with regard to the original and augmented image pairs, and conduct preference alignment with direct preference optimization.
Ranked #34 on Visual Question Answering on MM-Vet
1 code implementation • 15 Apr 2024 • Jinyue Chen, Lingyu Kong, Haoran Wei, Chenglong Liu, Zheng Ge, Liang Zhao, Jianjian Sun, Chunrui Han, Xiangyu Zhang
To address this, we propose OneChart: a reliable agent specifically devised for the structural extraction of chart information.
3 code implementations • 27 Feb 2024 • Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, He Wang, Li Yi, Kaisheng Ma
This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages.
Ranked #1 on 3D Point Cloud Linear Classification on ModelNet40
3D Object Captioning 3D Point Cloud Linear Classification +10
no code implementations • 23 Jan 2024 • Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, En Yu, Jianjian Sun, Chunrui Han, Xiangyu Zhang
In Vary-toy, we introduce an improved vision vocabulary, allowing the model to not only possess all features of Vary but also gather more generality.
Ranked #81 on Visual Question Answering on MM-Vet
1 code implementation • 11 Dec 2023 • Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, Xiangyu Zhang
Accordingly, we propose Vary, an efficient and effective method to scale up the vision vocabulary of LVLMs.
Ranked #56 on Visual Question Answering on MM-Vet
no code implementations • 30 Nov 2023 • En Yu, Liang Zhao, Yana Wei, Jinrong Yang, Dongming Wu, Lingyu Kong, Haoran Wei, Tiancai Wang, Zheng Ge, Xiangyu Zhang, Wenbing Tao
Then, FIT requires MLLMs to first predict trajectories of related objects and then reason about potential future events based on them.
Ranked #66 on Visual Question Answering on MM-Vet
1 code implementation • 20 Sep 2023 • Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, HongYu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, Li Yi
This paper presents DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models (MLLMs) empowered with frequently overlooked synergy between multimodal comprehension and creation.
Ranked #2 on Visual Question Answering on MMBench (GPT-3.5 score metric)
no code implementations • 18 Jul 2023 • Zhuoling Li, Chunrui Han, Zheng Ge, Jinrong Yang, En Yu, Haoqian Wang, Hengshuang Zhao, Xiangyu Zhang
Besides, GroupLane with ResNet18 still surpasses PersFormer by 4. 9% F1 score, while the inference speed is nearly 7x faster and the FLOPs is only 13. 3% of it.
no code implementations • 18 Jul 2023 • Liang Zhao, En Yu, Zheng Ge, Jinrong Yang, Haoran Wei, HongYu Zhou, Jianjian Sun, Yuang Peng, Runpei Dong, Chunrui Han, Xiangyu Zhang
Based on precise referring instruction, we propose ChatSpot, a unified end-to-end multimodal large language model that supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes, which provides a more flexible and seamless interactive experience.
no code implementations • 30 Jun 2023 • Weixin Mao, Jinrong Yang, Zheng Ge, Lin Song, HongYu Zhou, Tiezheng Mao, Zeming Li, Osamu Yoshie
In light of the success of sample mining techniques in 2D object detection, we propose a simple yet effective mining strategy for improving depth perception in 3D object detection.
1 code implementation • 16 Jun 2023 • Dongming Wu, Fan Jia, Jiahao Chang, Zhuoling Li, Jianjian Sun, Chunrui Han, Shuailin Li, Yingfei Liu, Zheng Ge, Tiancai Wang
We present the 1st-place solution of OpenLane Topology in Autonomous Driving Challenge.
1 code implementation • 15 Apr 2023 • Zhi Cai, Songtao Liu, Guodong Wang, Zheng Ge, Xiangyu Zhang, Di Huang
We propose a metric, recall of best-regressed samples, to quantitively evaluate the misalignment problem.
no code implementations • 9 Apr 2023 • Yinhao Li, Jinrong Yang, Jianjian Sun, Han Bao, Zheng Ge, Li Xiao
Bounded by the inherent ambiguity of depth perception, contemporary multi-view 3D object detection methods fall into the performance bottleneck.
no code implementations • 10 Mar 2023 • Chunrui Han, Jinrong Yang, Jianjian Sun, Zheng Ge, Runpei Dong, HongYu Zhou, Weixin Mao, Yuang Peng, Xiangyu Zhang
In this paper, we explore an embarrassingly simple long-term recurrent fusion strategy built upon the LSS-based methods and find it already able to enjoy the merits from both sides, i. e., rich long-term information and efficient fusion pipeline.
3 code implementations • 5 Feb 2023 • Zekun Qi, Runpei Dong, Guofan Fan, Zheng Ge, Xiangyu Zhang, Kaisheng Ma, Li Yi
This motivates us to learn 3D representations by sharing the merits of both paradigms, which is non-trivial due to the pattern difference between the two paradigms.
Ranked #1 on Zero-Shot Transfer 3D Point Cloud Classification on ModelNet10 (using extra training data)
3 code implementations • 16 Dec 2022 • Runpei Dong, Zekun Qi, Linfeng Zhang, Junbo Zhang, Jianjian Sun, Zheng Ge, Li Yi, Kaisheng Ma
The success of deep learning heavily relies on large-scale data with comprehensive labels, which is more expensive and time-consuming to fetch in 3D compared to 2D images or natural languages.
Ranked #5 on Few-Shot 3D Point Cloud Classification on ModelNet40 10-way (10-shot) (using extra training data)
Few-Shot 3D Point Cloud Classification Knowledge Distillation +1
2 code implementations • ICCV 2023 • HongYu Zhou, Zheng Ge, Zeming Li, Xiangyu Zhang
This paper proposes an efficient multi-camera to Bird's-Eye-View (BEV) view transformation method for 3D perception, dubbed MatrixVT.
Ranked #2 on Bird's-Eye View Semantic Segmentation on nuScenes (IoU lane - 224x480 - 100x100 at 0.5 metric)
no code implementations • 15 Nov 2022 • Jinrong Yang, Tiancai Wang, Zheng Ge, Weixin Mao, Xiaoping Li, Xiangyu Zhang
We propose a temporal 2D transformation to bridge the 3D predictions with temporal 2D labels.
1 code implementation • CVPR 2023 • Shichao Dong, Jin Wang, Renhe Ji, Jiajun Liang, Haoqiang Fan, Zheng Ge
In this paper, we analyse the generalization ability of binary classifiers for the task of deepfake detection.
3 code implementations • 21 Sep 2022 • Yinhao Li, Han Bao, Zheng Ge, Jinrong Yang, Jianjian Sun, Zeming Li
To this end, we introduce an effective temporal stereo method to dynamically select the scale of matching candidates, enable to significantly reduce computation overhead.
Ranked #11 on 3D Object Detection on nuScenes Camera Only
no code implementations • 22 Aug 2022 • Zengran Wang, Chen Min, Zheng Ge, Yinhao Li, Zeming Li, Hongyu Yang, Di Huang
Instead of using a sole monocular depth method, in this work, we propose a novel Surround-view Temporal Stereo (STS) technique that leverages the geometry correspondence between frames across time to facilitate accurate depth learning.
no code implementations • 19 Aug 2022 • HongYu Zhou, Zheng Ge, Weixin Mao, Zeming Li
To address this problem, we revisit the generation of BEV representation and propose detecting objects in perspective BEV -- a new BEV representation that does not require feature sampling.
2 code implementations • 6 Jul 2022 • HongYu Zhou, Zheng Ge, Songtao Liu, Weixin Mao, Zeming Li, Haiyan Yu, Jian Sun
To date, the most powerful semi-supervised object detectors (SS-OD) are based on pseudo-boxes, which need a sequence of post-processing with fine-tuned hyper-parameters.
2 code implementations • 21 Jun 2022 • Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, Zeming Li
In this research, we propose a new 3D object detector with a trustworthy depth estimation, dubbed BEVDepth, for camera-based Bird's-Eye-View (BEV) 3D object detection.
Ranked #4 on 3D Object Detection on Rope3D
1 code implementation • 27 Jul 2021 • Songyang Zhang, Lin Song, Songtao Liu, Zheng Ge, Zeming Li, Xuming He, Jian Sun
In this report, we introduce our real-time 2D object detection system for the realistic autonomous driving scenario.
41 code implementations • 18 Jul 2021 • Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, Jian Sun
In this report, we present some experienced improvements to YOLO series, forming a new high-performance detector -- YOLOX.
Ranked #1 on Real-Time Object Detection on Argoverse-HD (Detection-Only, Val) (using extra training data)
2 code implementations • CVPR 2021 • Zheng Ge, Songtao Liu, Zeming Li, Osamu Yoshie, Jian Sun
Recent advances in label assignment in object detection mainly seek to independently define positive/negative training samples for each ground-truth (gt) object.
Ranked #73 on Object Detection on COCO test-dev
1 code implementation • 12 Jan 2021 • Zheng Ge, JianFeng Wang, Xin Huang, Songtao Liu, Osamu Yoshie
A joint loss is then defined as the weighted summation of cls and reg losses as the assigning indicator.
no code implementations • 23 May 2020 • Zheng Ge, Zequn Jie, Xin Huang, Chengzheng Li, Osamu Yoshie
The first imbalance lies in the large number of low-quality RPN proposals, which makes the R-CNN module (i. e., post-classification layers) become highly biased towards the negative proposals in the early training stage.
no code implementations • CVPR 2020 • Xin Huang, Zheng Ge, Zequn Jie, Osamu Yoshie
To acquire the visible parts, a novel Paired-Box Model (PBM) is proposed to simultaneously predict the full and visible boxes of a pedestrian.
no code implementations • 16 Mar 2020 • Zheng Ge, Zequn Jie, Xin Huang, Rong Xu, Osamu Yoshie
PS-RCNN first detects slightly/none occluded objects by an R-CNN module (referred as P-RCNN), and then suppress the detected instances by human-shaped masks so that the features of heavily occluded instances can stand out.
Ranked #2 on Object Detection on WiderPerson