1. Vision Transformer (ViT) Process
Watch how an AI "reads" an image. It doesn't see the whole picture at once; it scans it patch-by-patch and converts each square into a vector.
224x224 pixels
#1
#2
#3
#4
#5
#6
#7
#8
#9
#10
#11
#12
#13
#14
#15
#16
Each patch is flattened into a vector (list of numbers) and fed into the model.
Start scan to see tokens