Visual Instruction Tuning - LLaVA

LLaVa (Large Language and Vision Assistant) is a state-of-the-art Visual Language Model (VLM) that excels in understanding and generating responses based on visual inputs. It builds upon the foundation of large language models by incorporating visual instruction tuning, enabling it to interpret images and provide contextually relevant answers. LLaVa leverages a combination of pre-trained vision encoders and large language models, fine-tuned on a diverse set of multimodal datasets to enhance its ability to follow visual instructions effectively. This makes LLaVa particularly adept at tasks such as image captioning, visual question answering, and other applications that require a deep understanding of both visual and textual information.

Visual Instruction Tuning (LLaVA)

Liu et al. — arXiv:2304.08485 (PDF)

Improved Baselines with Visual Instruction Tuning (LLaVA-1.5)

Liu et al. — arXiv:2310.03744 — updated version of LLaVA with improved visual encoder and MLP connector

LLaVA Notebook

Open in Google Colab

LLaVA — Hugging Face Transformers

Model documentation and API reference on Hugging Face

Key references: (Xu et al., 2015; Johnson et al., 2016; Lu et al., 2016; Vinyals et al., 2016; Anderson et al., 2017)

LLaVA fine-tuning

The original LLaVA paper introduced the LLaVA-Instruct-150K dataset to fine-tune the model on visual instruction following. It contains approximately 150K GPT-generated instruction–answer pairs grounded in COCO images, formatted as multi-turn conversations. Each sample pairs a COCO image with a conversation between a human and a GPT assistant:

{
  "image": "COCO_train2014_000000000009.jpg",
  "conversations": [
    {"from": "human", "value": "What is the man doing?"},
    {"from": "gpt", "value": "He is riding a skateboard."}
  ]
}

Dataset properties

Property	Value
Size	~150K samples
Source	GPT-4 generated instructions and answers
Format	Multi-turn conversational
Images	COCO train2014

This dataset can be used directly with the Hugging Face Transformers LLaVA pipeline for supervised fine-tuning.

LLaVA-Instruct-150K

Dataset on Hugging Face — liuhaotian/LLaVA-Instruct-150K

References

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., et al. (2017). Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.
Johnson, J., Hariharan, B., Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., et al. (2016). CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning.
Lu, J., Xiong, C., Parikh, D., Socher, R. (2016). Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.
Vinyals, O., Toshev, A., Bengio, S., Erhan, D. (2016). Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., et al. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.

Edit this page on GitHub or file an issue.

Vision-Language Models