Learn AI

AI Concepts Workshop

© 2026 Cloudy Software Ltd

Multi-Modal Vision

How AI models 'see' and interpret visual information.

1. Vision Transformer (ViT) Process

Watch how an AI "reads" an image. It doesn't see the whole picture at once; it scans it patch-by-patch and converts each square into a vector.

224x224 pixels
#1
#2
#3
#4
#5
#6
#7
#8
#9
#10
#11
#12
#13
#14
#15
#16

Each patch is flattened into a vector (list of numbers) and fed into the model.

Start scan to see tokens