1 code implementation • 18 Jan 2024 • Yang Zhan, Zhitong Xiong, Yuan Yuan
Specifically, after projecting RS visual features to the language domain via an alignment layer, they are fed jointly with task-specific instructions into an LLM-based RS decoder to predict answers for RS open-ended tasks.
1 code implementation • 13 Dec 2023 • Yang Zhan, Yuan Yuan, Zhitong Xiong
To foster this task, we propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings for multi-modal learning and 3D object localization.
1 code implementation • 24 Aug 2023 • Yuan Yuan, Yang Zhan, Zhitong Xiong
To address this issue, in this work, we investigate the parameter-efficient transfer learning (PETL) method to effectively and efficiently transfer visual-language knowledge from the natural domain to the RS domain on the image-text retrieval task.
Ranked #3 on Cross-Modal Retrieval on RSICD
1 code implementation • 23 Oct 2022 • Yang Zhan, Zhitong Xiong, Yuan Yuan
However, the object-level visual grounding on RS images is still under-explored.