Dilated Neighborhood Attention Transformer

29 Sep 2022  ·  Ali Hassani, Humphrey Shi ·

Transformers are quickly becoming one of the most heavily applied deep learning architectures across modalities, domains, and tasks. In vision, on top of ongoing efforts into plain transformers, hierarchical transformers have also gained significant attention, thanks to their performance and easy integration into existing frameworks. These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA) or Swin Transformer's Shifted Window Self Attention. While effective at reducing self attention's quadratic complexity, local attention weakens two of the most desirable properties of self attention: long range inter-dependency modeling, and global receptive field. In this paper, we introduce Dilated Neighborhood Attention (DiNA), a natural, flexible and efficient extension to NA that can capture more global context and expand receptive fields exponentially at no additional cost. NA's local attention and DiNA's sparse global attention complement each other, and therefore we introduce Dilated Neighborhood Attention Transformer (DiNAT), a new hierarchical vision transformer built upon both. DiNAT variants enjoy significant improvements over strong baselines such as NAT, Swin, and ConvNeXt. Our large model is faster and ahead of its Swin counterpart by 1.6% box AP in COCO object detection, 1.4% mask AP in COCO instance segmentation, and 1.4% mIoU in ADE20K semantic segmentation. Paired with new frameworks, our large variant is the new state of the art panoptic segmentation model on COCO (58.5 PQ) and ADE20K (49.4 PQ), and instance segmentation model on Cityscapes (45.1 AP) and ADE20K (35.4 AP) (no extra data). It also matches the state of the art specialized semantic segmentation models on ADE20K (58.1 mIoU), and ranks second on Cityscapes (84.5 mIoU) (no extra data).

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Semantic Segmentation ADE20K DiNAT-Mini (UperNet) Validation mIoU 47.2 # 157
Semantic Segmentation ADE20K DiNAT-L (Mask2Former) Validation mIoU 58.1 # 19
Semantic Segmentation ADE20K DiNAT_s-Large (UperNet) Validation mIoU 54.6 # 53
Semantic Segmentation ADE20K DiNAT-Large (UperNet) Validation mIoU 54.9 # 47
Semantic Segmentation ADE20K DiNAT-Base (UperNet) Validation mIoU 50.4 # 106
Semantic Segmentation ADE20K DiNAT-Small (UperNet) Validation mIoU 49.9 # 116
Semantic Segmentation ADE20K DiNAT-Tiny (UperNet) Validation mIoU 48.8 # 135
Instance Segmentation ADE20K val DiNAT-L (Mask2Former, single-scale) AP 35.4 # 8
APS 16.3 # 4
APM 39.0 # 5
APL 55.5 # 4
Panoptic Segmentation ADE20K val DiNAT-L (Mask2Former, 640x640) PQ 49.4 # 13
AP 35.0 # 10
mIoU 56.3 # 11
Semantic Segmentation ADE20K val DiNAT-L (Mask2Former) mIoU 58.1 # 15
Panoptic Segmentation Cityscapes val DiNAT-L (Mask2Former) PQ 67.2 # 11
mIoU 83.4 # 8
AP 44.5 # 8
Instance Segmentation Cityscapes val DiNAT-L (single-scale, Mask2Former) mask AP 45.1 # 7
AP50 72.6 # 3
Semantic Segmentation Cityscapes val DiNAT-L (Mask2Former) mIoU 84.5 # 14
Instance Segmentation COCO minival DiNAT-L (single-scale, Mask2Former) mask AP 50.8 # 20
AP50 75.0 # 4
Panoptic Segmentation COCO minival DiNAT-L (single-scale, Mask2Former) PQ 58.5 # 4
PQth 64.9 # 3
PQst 48.8 # 2
AP 49.2 # 4
mIoU 68.3 # 2
Image Classification <h2>oi</h2> DiNAT_s-Large (224x224; Pretrained on ImageNet-22K @ 224x224) Top 1 Accuracy 86.5% # 136
GFLOPs 34.5 # 400
Image Classification <h2>oi</h2> DiNAT-Large (11x11ks; 384res; Pretrained on IN22K@224) Top 1 Accuracy 87.5% # 87
Number of params 200M # 903
GFLOPs 92.4 # 444
Image Classification <h2>oi</h2> DiNAT_s-Large (384res; Pretrained on IN22K@224) Top 1 Accuracy 87.4% # 94
Number of params 197M # 899
GFLOPs 101.5 # 448
Image Classification <h2>oi</h2> DiNAT-Mini Top 1 Accuracy 81.8% # 554
Number of params 20M # 538
GFLOPs 2.7 # 167
Image Classification <h2>oi</h2> DiNAT-Base Top 1 Accuracy 84.4% # 300
Number of params 90M # 849
GFLOPs 13.7 # 330
Image Classification <h2>oi</h2> DiNAT-Small Top 1 Accuracy 83.8% # 359
Number of params 51M # 731
GFLOPs 7.8 # 261
Image Classification <h2>oi</h2> DiNAT-Tiny Top 1 Accuracy 82.7% # 466
Number of params 28M # 631
GFLOPs 4.3 # 202
Image Classification <h2>oi</h2> DiNAT-Large (384x384; Pretrained on ImageNet-22K @ 224x224) Top 1 Accuracy 87.4% # 94
GFLOPs 89.7 # 443

Methods