- CLIP — the classic image-text alignment model that uses contrastive learning to embed images and captions into a shared space.
- LLaVA — directly combines a vision encoder (CLIP ViT) with a large language model (Llama 2, Vicuna) for instruction-following, dialogue, and rich vision-language reasoning.
In this chapter
CLIP
Contrastive Language-Image Pretraining — the foundation of modern vision-language models.
CLIP Zero-Shot Classification
Hands-on notebook: zero-shot classification as a linear classifier whose weights come from text prompts.
LLaVA
Large Language and Vision Assistant — instruction-tuned multimodal dialogue.
References
- Johnson, J., Hariharan, B., Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., et al. (2016). CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning.
- Lu, J., Xiong, C., Parikh, D., Socher, R. (2016). Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.
- Vinyals, O., Toshev, A., Bengio, S., Erhan, D. (2016). Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge.
- Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., et al. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.

