Published on Mon May 10 2021

Self-Supervised Learning with Swin Transformers

Zhenda Xie, Yutong Lin, Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao, Han Hu

We are witnessing a modeling shift from CNN to Transformers in computer vision. In this work, we present a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture. The approach is combined from MoCo v2 and BYOL and tuned to achieve reasonably high

2
55
249
Abstract

We are witnessing a modeling shift from CNN to Transformers in computer vision. In this work, we present a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture. The approach basically has no new inventions, which is combined from MoCo v2 and BYOL and tuned to achieve reasonably high accuracy on ImageNet-1K linear evaluation: 72.8% and 75.0% top-1 accuracy using DeiT-S and Swin-T, respectively, by 300-epoch training. The performance is slightly better than recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks. More importantly, the general-purpose Swin Transformer backbone enables us to also evaluate the learnt representations on downstream tasks such as object detection and semantic segmentation, in contrast to a few recent approaches built on ViT/DeiT which only report linear evaluation results on ImageNet-1K due to ViT/DeiT not tamed for these dense prediction tasks. We hope our results can facilitate more comprehensive evaluation of self-supervised learning methods designed for Transformer architectures. Our code and models are available at https://github.com/SwinTransformer/Transformer-SSL, which will be continually enriched.

Sat Dec 12 2020
Machine Learning
The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models
Pre-trained weights often boost a wide range of downstream tasks including classification, detection, and segmentation. Latest studies suggest that pre-training benefits from gigantic model capacity. We examine supervised and self-supervised pre-trained models through the lens of the lottery ticket hypothesis (LTH)
5
9
50
Mon Apr 05 2021
Computer Vision
An Empirical Study of Training Self-Supervised Vision Transformers
This paper does not describe a novel method. Instead, it studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning.
8
20
70
Thu Apr 22 2021
Computer Vision
ImageNet-21K Pretraining for the Masses
ImageNet-1K serves as the primary dataset for pretraining deep learning models for computer vision tasks. ImageNet-21K dataset is used less frequently due to its complexity. This paper aims to close this gap, and make high-quality efficient pretraining available for everyone.
8
4
2
Fri Jun 18 2021
Computer Vision
How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications. In comparison to convolutional neural networks, the Vision Transformer's weaker inductivebias is generally found to cause an increased reliance on model regularization or data augmentation.
15
249
1,002
Thu Apr 08 2021
Computer Vision
SiT: Self-supervised vIsion Transformer
Self-supervised learning methods are gaining increasing traction in computer vision. This is due to their recent success in reducing the gap with supervised learning. In natural language processing (NLP) self-super supervised learning and transformers are already the methods of choice.
1
0
4
Thu Jun 11 2020
Computer Vision
Rethinking Pre-training and Self-training
He et al. show a surprising result that ImageNet pre-training has limited impact on COCO object detection. Self-training shows positive improvements from +1.3 to +3.4AP across all dataset sizes.
4
0
7
Thu Apr 29 2021
Computer Vision
Emerging Properties in Self-Supervised Vision Transformers
Self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets), says study. Study also underlines importance of momentum encoder, multi-crop training, and use of small patches with ViTs.
17
202
1,100
Mon Jun 12 2017
NLP
Attention Is All You Need
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms. Experiments on two machine translation tasks show these models to be superior in
50
215
883
Wed Dec 23 2020
Computer Vision
Training data-efficient image transformers & distillation through attention
We produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet.
2
38
155
Thu Mar 25 2021
Computer Vision
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
This paper presents a new vision Transformer that can serve as a general-purpose backbone for computer vision. It is compatible with a broad range of vision tasks, including image classification and dense prediction. Its performance surpasses the previous state-of-the-art by a large margin. The code and models will be made publicly available.
3
20
112
Sat Jun 13 2020
Machine Learning
Bootstrap your own latent: A new approach to self-supervised Learning
Bootstrap Your Own Latent (BYOL) is a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other.
1
30
112
Thu Oct 22 2020
Computer Vision
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks. ViT requires substantially fewer computational resources to train.
16
26
110
Tue Jun 15 2021
Computer Vision
BEiT: BERT Pre-Training of Image Transformers
We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. The pre-training objective is to recover the original visual tokens based on the corrupted image patches.
12
123
820
Thu Jun 17 2021
Computer Vision
Efficient Self-supervised Vision Transformers for Representation Learning
This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) The code and models will publicly available.
7
93
447
Thu Jun 17 2021
Computer Vision
XCiT: Cross-Covariance Image Transformers
Cross-covariance image transformer (XCiT) is built upon XCA. It combines the accuracy of conventional transformers with the scalability of convolutional architectures. XCA has linear complexity and allows efficient processing of high-resolution images.
4
21
117
Thu Jun 10 2021
Computer Vision
MST: Masked Self-Supervised Transformer for Visual Representation
MST can explicitly capture the local context of an image while preserving the global semantic information. It achieves Top-1 accuracy of 76.9% with DeiT-S only using 100-epoch pre-training. The experiments on multipledatasets demonstrate the effectiveness and generality of the proposed method.
1
13
45
Mon Jun 21 2021
Computer Vision
SODA10M: Towards Large-Scale Object Detection Benchmark for Autonomous Driving
SODA10M contains 10 million unlabeled images and 20K images labeled with 6 representative object categories. The data and more up-to-date information have been released at https://soda-2d.io.
6
0
0