Llava

Vision-Language PyTorch Flickr8K

Overview

From-scratch replication of LLaVA (Large Language and Vision Assistant). LLaVA connects a vision encoder to a language model decoder via a trainable projection layer, enabling visual instruction following. The projection maps image patch embeddings into the LLM’s token embedding space. Based on Visual Instruction Tuning (Liu et al., 2023).

Architecture

Vision encoder: CLIP ViT (image patch embeddings)
Projection: Linear layer mapping image embeddings → LLM input space
Language decoder: LLM (instruction-following)
Two-stage training: (1) train projection with frozen encoders; (2) fine-tune full model

Training

Hyperparameter	Value
Dataset	Flickr8K (image-caption pairs as instruction-following data)
Epochs	5
Hardware	T4 GPU

Results

Split	Loss
Train	0.23
Validation	0.22

Paper

Visual Instruction Tuning — Liu et al., 2023

Yuvraj Singh

Overview

Architecture

Training

Results

Paper