I implemented a vision transformer (ViT) model, which was based on the paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”. In this video, the focus is on (1) building a pytorch vision transformer (ViT) model (2) training the model on MNIST dataset which we import from torchvision (3) feeding test samples to the transformer and visualizing its responses.
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
- notes+code: https://mashaan14.github.io/YouTube-channel/vision_transformers/2023_11_29_VisionTransformer_MNIST
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
- website https://mashaan14.github.io/mashaan/
- github https://github.com/mashaan14
- X https://twitter.com/mashaan_14
- linkedin https://linkedin.com/in/mashaan
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
Chapters:
0:00 start
0:15 acknowledgement
0:47 importing datasets
2:53 model parameters
3:57 converting an image to patches
5:23 class AttentionBlock
6:11 class VisionTransformer
7:32 model printout
8:15 training loop
10:21 inference
10:39 attention map for a test sample
14:43 plotting the attention map
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
References:
- Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
#attention #transformers #VisionTransformer #imageclassification #mnist #computervision #pytorch #DeepLearningTutorial #MachineLearningProject #AIResearch #CodingTutorial