5.1 Serving Engine Setup

Deploy the model using vLLM (recommended) or HuggingFace Text Generation Inference (TGI). - Configure: - Tensor parallelism (if multiple GPUs are available). - Maximum model length. - GPU memory utilization target (e.g., 90%). - Maximum number of concurrent requests. - Verify the model loads and gen