Built the VLM fine-tuning pipeline from scratch as a separate service on GCP, opening up an entirely new product line. Supports multiple frontier vision-language models with efficient fine-tuning techniques and diverse training modalities including VQA, Chain-of-Thought reasoning, and video.
Highlights
▸Established the full service from scratch: CI/CD, GCP deployment, training initialiser, run manager integration
▸Supports QWEN2.5-VL, QWEN3-VL, NVILA, CosmosReason1/2, and KimiVL
▸LoRA fine-tuning with quantization config — reduces GPU memory requirements for large model training
▸Tensor parallelism for multi-GPU training of large models; OOM-resilient training loops
▸VQA training with schema design, data collator, and annotation kind handling
▸Chain-of-Thought (CoT) reasoning with structured evaluation returning phrase grounding indices
▸Freeform open-ended generation training for CosmosReason2 and QWEN3-VL
▸Video training via PyAV ingestion with temporal expansion and evaluation preview
▸Intelliscribe caption generation microservice with JSON schema validation for structured outputs