Introduction to Vision Transformer (ViT) | An image is worth 16×16 words | Computer Vision Series
What do CNNs, GPT-2, and Vision Transformers have in common?
In this deep, visual, and intuitive lecture, we take you step-by-step from convolutional filters all the way to multi-head attention, and show you how Transformers — originally made for text — are now redefining how we process images.
We start with CNNs:
Why did they dominate for years? How do concepts like locality, weight sharing, and translation invariance actually work?
And why do these same properties cause problems — like when a car on top of a building still gets detected confidently as “just a car”?
You’ll see how small receptive fields, inductive bias, and vanishing gradients made CNNs both powerful and limited. Then we shift to transformers — and this is where things get really interesting.
You’ll learn:
What the GPT-2 architecture looks like internally
How a sentence becomes tokens, each mapped to a 768-dimensional embedding
How position embedding helps transformers understand word order
What query, key, and value vectors do — and how they create the attention matrix
Why we need multi-head attention (12 heads in GPT-2)
What happens inside a transformer block, including Add & Norm, residual connections, and MLP layers
How attention prevents peeking into the future using triangular masking
Then we connect all of this to Vision Transformers (ViT) — introduced in the paper “An Image is Worth 16x16 Words” by Google Brain.
🧠 You’ll understand:
How an image is split into patches (e.g., 16x16)
How each patch is flattened and turned into a 768-dimensional vector
Why we add a CLS token at the start
How position embeddings are added to patch embeddings
How ViT uses transformer blocks (with the exact same components as text models)
Why only the CLS token is used for final image classification
How to modify the last layer for 5-class or 1000-class problems
Why ViT generalizes better on large datasets compared to CNNs
Throughout the lecture, we also refer to visualizations from poloclub.github.io, where you can interactively explore how transformer attention works.
📌 If you're a:
Machine learning student or researcher new to vision transformers
Developer trying to move from CNNs to ViTs
Curious mind trying to understand what really happens inside models like GPT and ViT
— then this lecture is for you.
No code today — just ideas, diagrams, equations, and clarity.
In the next lecture, we’ll implement Vision Transformers from scratch and compare them with CNNs on the same dataset.
Poloclub interactive transformer: https://poloclub.github.io/transformer-explainer/
What do CNNs, GPT-2, and Vision Transformers have in common?
In this deep, visual, and intuitive lecture, we take you step-by-step from convolutional filters all the way to multi-head attention, and show you how Transformers — originally made for text — are now redefining how we process images.
We start with CNNs:
Why did they dominate for years? How do concepts like locality, weight sharing, and translation invariance actually work?
And why do these same properties cause problems — like when a car on top of a building still gets detected confidently as “just a car”?
You’ll see how small receptive fields, inductive bias, and vanishing gradients made CNNs both powerful and limited. Then we shift to transformers — and this is where things get really interesting.
You’ll learn:
What the GPT-2 architecture looks like internally
How a sentence becomes tokens, each mapped to a 768-dimensional embedding
How position embedding helps transformers understand word order
What query, key, and value vectors do — and how they create the attention matrix
Why we need multi-head attention (12 heads in GPT-2)
What happens inside a transformer block, including Add & Norm, residual connections, and MLP layers
How attention prevents peeking into the future using triangular masking
Then we connect all of this to Vision Transformers (ViT) — introduced in the paper “An Image is Worth 16×16 Words” by Google Brain.
🧠 You’ll understand:
How an image is split into patches (e.g., 16×16)
How each patch is flattened and turned into a 768-dimensional vector
Why we add a CLS token at the start
How position embeddings are added to patch embeddings
How ViT uses transformer blocks (with the exact same components as text models)
Why only the CLS token is used for final image classification
How to modify the last layer for 5-class or 1000-class problems
Why ViT generalizes better on large datasets compared to CNNs
Throughout the lecture, we also refer to visualizations from poloclub.github.io, where you can interactively explore how transformer attention works.
📌 If you’re a:
Machine learning student or researcher new to vision transformers
Developer trying to move from CNNs to ViTs
Curious mind trying to understand what really happens inside models like GPT and ViT
— then this lecture is for you.
No code today — just ideas, diagrams, equations, and clarity.
In the next lecture, we’ll implement Vision Transformers from scratch and compare them with CNNs on the same dataset.
Poloclub interactive transformer: https://poloclub.github.io/transformer-explainer/