These are transformer models applied to computer vision tasks, which typically are dominated by convolutional neural networks. Instead of viewing images as a 2D grid of pixels, ViT treats an image as a sequence of patches and applies self-attention mechanisms to these sequences.