[ad_1]
This text is a part of a group analyzing the inner workings of Imaginative and prescient Transformers in depth. Every of those articles can also be accessible as a Jupyter Pocket book with executable code. The opposite articles within the sequence are:
Desk of Contents
As launched in Consideration is All You Want¹, transformers are a kind of machine studying mannequin using consideration as the first studying mechanism. Transformers rapidly grew to become the state-of-the-art for sequence-to-sequence duties similar to language translation.
An Picture is Price 16×16 Phrases² efficiently modified the transformer put forth in [1] to unravel picture classification duties, creating the Vision Transformer (ViT). The ViT relies on the identical consideration mechanism because the transformer in [1]. Nonetheless, whereas transformers for NLP duties encompass an encoder consideration department and a decoder consideration department, the ViT solely makes use of an encoder. The output of the encoder is then handed to a neural community “head” that makes a prediction.
The disadvantage of ViT as carried out in [2] is that it’s optimum efficiency requires pretraining on giant datasets. The very best fashions pretrained on the proprietary JFT-300M dataset. Fashions pretrained on the smaller, open supply ImageNet-21k carry out on par with the state-of-the-art convolutional ResNet fashions.
Tokens-to-Token ViT: Coaching Imaginative and prescient Transformers from Scratch on ImageNet³ makes an attempt to take away this pretraining requirement by introducing a novel pre-processing methodology to rework an enter picture right into a sequence of tokens. Extra about this methodology may be discovered right here. For this text, we’ll deal with the ViT as carried out in [2].
This text follows the mannequin construction outlined in An Picture is Price 16×16 Phrases². Nonetheless, code from this paper is just not publicly accessible. Code from the more moderen Tokens-to-Token ViT³ is accessible on GitHub. The Tokens-to-Token ViT (T2T-ViT) mannequin prepends a Tokens-to-Token (T2T) module to a vanilla ViT spine. The code on this article relies on the ViT parts within the Tokens-to-Token ViT³ GitHub code. Modifications made for this text embody, however should not restricted to, modifying to permit for non-square enter photos and eradicating dropout layers.
A diagram of the ViT mannequin is proven beneath.
Picture Tokenization
Step one of the ViT is to create tokens from the enter picture. Transformers function on a sequence of tokens; in NLP, that is generally a sentence of phrases. For pc imaginative and prescient, it’s much less clear learn how to section the enter into tokens.
The ViT converts a picture to tokens such that every token represents a neighborhood space — or patch — of the picture. They describe reshaping a picture of top H, width W, and channels C into N tokens with patch measurement P:
[ad_2]