Stage 2: Visual Instruction Tuning

Data: 158K multimodal instruction-following examples generated using GPT-4 - The projection layer and the LLM are trained; the vision encoder remains frozen - Data includes conversations, detailed descriptions, and complex reasoning questions - This stage teaches the model to follow multimodal instr