Multi-Modal Vision

How AI models 'see' and interpret visual information.

1. Vision Transformer (ViT) Process

Watch how an AI "reads" an image. It doesn't see the whole picture at once; it scans it patch-by-patch and converts each square into a vector.

Input Image224x224 pixels

#1

#2

#3

#4

#5

#6

#7

#8

#9

#10

#11

#12

#13

#14

#15

#16

Visual Token Stream

Each patch is flattened into a vector (list of numbers) and fed into the model.

Start scan to see tokens