Llava
Overview
From-scratch replication of LLaVA (Large Language and Vision Assistant). LLaVA connects a vision encoder to a language model decoder via a trainable projection layer, enabling visual instruction following. The projection maps image patch embeddings into the LLM’s token embedding space. Based on Visual Instruction Tuning (Liu et al., 2023).
Architecture
- Vision encoder: CLIP ViT (image patch embeddings)
- Projection: Linear layer mapping image embeddings → LLM input space
- Language decoder: LLM (instruction-following)
- Two-stage training: (1) train projection with frozen encoders; (2) fine-tune full model
Training
| Hyperparameter | Value |
|---|---|
| Dataset | Flickr8K (image-caption pairs as instruction-following data) |
| Epochs | 5 |
| Hardware | T4 GPU |
Results
| Split | Loss |
|---|---|
| Train | 0.23 |
| Validation | 0.22 |
Paper
Visual Instruction Tuning — Liu et al., 2023