LLaVA fine-tuning
The original LLaVA paper introduced the LLaVA-Instruct-150K dataset to fine-tune the model on visual instruction following. It contains approximately 150K GPT-generated instruction–answer pairs grounded in COCO images, formatted as multi-turn conversations. Each sample pairs a COCO image with a conversation between a human and a GPT assistant:| Property | Value |
|---|---|
| Size | ~150K samples |
| Source | GPT-4 generated instructions and answers |
| Format | Multi-turn conversational |
| Images | COCO train2014 |
References
- Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., et al. (2017). Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.
- Johnson, J., Hariharan, B., Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., et al. (2016). CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning.
- Lu, J., Xiong, C., Parikh, D., Socher, R. (2016). Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.
- Vinyals, O., Toshev, A., Bengio, S., Erhan, D. (2016). Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge.
- Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., et al. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.


