TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
3D Reconstruction	DTU	MVSFormer	Acc	0.327	# 8
3D Reconstruction	DTU	MVSFormer	Overall	0.289	# 2
3D Reconstruction	DTU	MVSFormer	Comp	0.251	# 1
Point Clouds	Tanks and Temples	MVSFormer	Mean F1 (Intermediate)	66.37	# 2
Point Clouds	Tanks and Temples	MVSFormer	Mean F1 (Advanced)	40.87	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvsformer-learning-robust-image/3d-reconstruction-on-dtu)](https://paperswithcode.com/sota/3d-reconstruction-on-dtu?p=mvsformer-learning-robust-image)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvsformer-learning-robust-image/point-clouds-on-tanks-and-temples)](https://paperswithcode.com/sota/point-clouds-on-tanks-and-temples?p=mvsformer-learning-robust-image)`

MVSFormer: Multi-View Stereo by Learning Robust Image Features and Temperature-based Depth

4 Aug 2022 · Chenjie Cao, Xinlin Ren, Yanwei Fu ·

Feature representation learning is the key recipe for learning-based Multi-View Stereo (MVS). As the common feature extractor of learning-based MVS, vanilla Feature Pyramid Networks (FPNs) suffer from discouraged feature representations for reflection and texture-less areas, which limits the generalization of MVS. Even FPNs worked with pre-trained Convolutional Neural Networks (CNNs) fail to tackle these issues. On the other hand, Vision Transformers (ViTs) have achieved prominent success in many 2D vision tasks. Thus we ask whether ViTs can facilitate feature learning in MVS? In this paper, we propose a pre-trained ViT enhanced MVS network called MVSFormer, which can learn more reliable feature representations benefited by informative priors from ViT. The finetuned MVSFormer with hierarchical ViTs of efficient attention mechanisms can achieve prominent improvement based on FPNs. Besides, the alternative MVSFormer with frozen ViT weights is further proposed. This largely alleviates the training cost with competitive performance strengthened by the attention map from the self-distillation pre-training. MVSFormer can be generalized to various input resolutions with efficient multi-scale training strengthened by gradient accumulation. Moreover, we discuss the merits and drawbacks of classification and regression-based MVS methods, and further propose to unify them with a temperature-based strategy. MVSFormer achieves state-of-the-art performance on the DTU dataset. Particularly, MVSFormer ranks as Top-1 on both intermediate and advanced sets of the highly competitive Tanks-and-Temples leaderboard.

PDF Abstract

Code

Add Remove Mark official

ewrfcas/mvsformer official

170

Tasks

Add Remove

3D Reconstruction

Point Clouds

Representation Learning

Datasets

DTU

ETH3D

BlendedMVS

Tanks and Temples

Results from the Paper

Edit

Ranked #2 on 3D Reconstruction on DTU

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
3D Reconstruction	DTU	MVSFormer	Acc	0.327	# 8	Compare
			Overall	0.289	# 2	Compare
			Comp	0.251	# 1	Compare
Point Clouds	Tanks and Temples	MVSFormer	Mean F1 (Intermediate)	66.37	# 2	Compare
Point Clouds	Tanks and Temples	MVSFormer	Mean F1 (Advanced)	40.87	# 3	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

MVSFormer: Multi-View Stereo by Learning Robust Image Features and Temperature-based Depth

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove