Skip to main content
This chapter covers multimodal AI systems that combine vision and language understanding. You will work through two architectures, each one building on the limitations of the previous:
  • CLIP — the classic image-text alignment model that uses contrastive learning to embed images and captions into a shared space.
  • LLaVA — directly combines a vision encoder (CLIP ViT) with a large language model (Llama 2, Vicuna) for instruction-following, dialogue, and rich vision-language reasoning.

In this chapter

CLIP

Contrastive Language-Image Pretraining — the foundation of modern vision-language models.

CLIP Zero-Shot Classification

Hands-on notebook: zero-shot classification as a linear classifier whose weights come from text prompts.

LLaVA

Large Language and Vision Assistant — instruction-tuned multimodal dialogue.
Key references: (Lu et al., 2016; Johnson et al., 2016; Xu et al., 2015; Vinyals et al., 2016)

References

  • Johnson, J., Hariharan, B., Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., et al. (2016). CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning.
  • Lu, J., Xiong, C., Parikh, D., Socher, R. (2016). Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.
  • Vinyals, O., Toshev, A., Bengio, S., Erhan, D. (2016). Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge.
  • Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., et al. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.