Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (ie, without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.

PDF Abstract CVPR 2021 PDF CVPR 2021 Abstract

Results from the Paper


Ranked #2 on Semantic Segmentation on FoodSeg103 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Semantic Segmentation ADE20K SETR-MLA (160k, MS) Validation mIoU 50.28 # 109
Semantic Segmentation Cityscapes test SETR-PUP++ Mean IoU (class) 81.64% # 36
Semantic Segmentation Cityscapes val SETR-PUP (80k, MS) mIoU 82.15 # 33
Semantic Segmentation DADA-seg SETR (MLA, Transformer-Large) mIoU 30.4 # 5
Semantic Segmentation DADA-seg SETR (PUP, Transformer-Large) mIoU 31.8 # 4
Semantic Segmentation DensePASS SETR (MLA, Transformer-L) mIoU 35.6% # 19
Semantic Segmentation DensePASS SETR (PUP, Transformer-L) mIoU 35.7% # 18
Semantic Segmentation FoodSeg103 SeTR-MLA (ViT-16/B) mIoU 45.1 # 2
Semantic Segmentation PASCAL Context SETR-MLA (16, 80k, MS) mIoU 55.83 # 24
Medical Image Segmentation Synapse multi-organ CT SETR Avg DSC 79.60 # 11
Semantic Segmentation UrbanLF SETR (ViT-Large) mIoU (Real) 77.74 # 6
mIoU (Syn) 77.69 # 9

Results from Other Papers


Task Dataset Model Metric Name Metric Value Rank Uses Extra
Training Data
Source Paper Compare
Semantic Segmentation FoodSeg103 SeTR-Naive (ViT-16/B) mIoU 41.3 # 5

Methods