THE FUTURE IS HERE

Introduction to Vision Transformer (ViT) | An image is worth 16×16 words | Computer Vision Series

What do CNNs, GPT-2, and Vision Transformers have in common?

In this deep, visual, and intuitive lecture, we take you step-by-step from convolutional filters all the way to multi-head attention, and show you how Transformers — originally made for text — are now redefining how we process images.

We start with CNNs:
Why did they dominate for years? How do concepts like locality, weight sharing, and translation invariance actually work?
And why do these same properties cause problems — like when a car on top of a building still gets detected confidently as “just a car”?

You’ll see how small receptive fields, inductive bias, and vanishing gradients made CNNs both powerful and limited. Then we shift to transformers — and this is where things get really interesting.

You’ll learn:

What the GPT-2 architecture looks like internally

How a sentence becomes tokens, each mapped to a 768-dimensional embedding

How position embedding helps transformers understand word order

What query, key, and value vectors do — and how they create the attention matrix

Why we need multi-head attention (12 heads in GPT-2)

What happens inside a transformer block, including Add & Norm, residual connections, and MLP layers

How attention prevents peeking into the future using triangular masking

Then we connect all of this to Vision Transformers (ViT) — introduced in the paper “An Image is Worth 16×16 Words” by Google Brain.

🧠 You’ll understand:

How an image is split into patches (e.g., 16×16)

How each patch is flattened and turned into a 768-dimensional vector

Why we add a CLS token at the start

How position embeddings are added to patch embeddings

How ViT uses transformer blocks (with the exact same components as text models)

Why only the CLS token is used for final image classification

How to modify the last layer for 5-class or 1000-class problems

Why ViT generalizes better on large datasets compared to CNNs

Throughout the lecture, we also refer to visualizations from poloclub.github.io, where you can interactively explore how transformer attention works.

📌 If you’re a:

Machine learning student or researcher new to vision transformers

Developer trying to move from CNNs to ViTs

Curious mind trying to understand what really happens inside models like GPT and ViT
— then this lecture is for you.

No code today — just ideas, diagrams, equations, and clarity.
In the next lecture, we’ll implement Vision Transformers from scratch and compare them with CNNs on the same dataset.

Poloclub interactive transformer: https://poloclub.github.io/transformer-explainer/