Stage 1: Feature Alignment Pre-training

Data: 595K image-text pairs from CC3M (filtered) - Only the projection layer $\mathbf{W}$ is trained; both the vision encoder and LLM are frozen - Objective: Image captioning (generate the caption given the image) - This stage teaches the projection layer to translate visual features into the LLM's