Programming Practical 2: ViT, CLIP, VLM¶

This practical session explores the key building blocks behind modern vision-language models through three progressive notebooks. Each one builds on the concepts introduced in the previous one.

1. Vision Transformer (ViT)¶

We start with the Vision Transformer, the architecture that brought the transformer paradigm from NLP to computer vision. In this notebook you will load a pretrained ViT, visualize how an image is split into patch tokens, inspect attention maps, and run image classification inference.

Notebook:

Solution:

2. CLIP¶

A vanilla ViT can only classify images into a fixed set of categories. CLIP (Contrastive Language-Image Pretraining) overcomes this limitation by learning a shared embedding space for images and text. In this notebook you will see how CLIP aligns visual and textual representations, enabling zero-shot classification, image-text retrieval, and open-vocabulary recognition.

Notebook:

Solution:

3. Vision-Language Model (VLM)¶

With a vision encoder (ViT) and a way to bridge images and text (CLIP), the final step is to build a Vision-Language Model that can actually reason about images in natural language. In this notebook you will dissect a LLaVA-style VLM (SigLIP + MLP projector + Qwen2) and understand how each component contributes to the full pipeline, from pixels to conversational answers.