Datature · Sole Author

VLM Fine-Tuning Pipeline

Jul 2025 — Present

Built the VLM fine-tuning pipeline from scratch as a separate service on GCP, opening up an entirely new product line. Supports multiple frontier vision-language models with efficient fine-tuning techniques and diverse training modalities including VQA, Chain-of-Thought reasoning, and video. Extending the pipeline with tool-calling fine-tuning and RLHF — enabling models to learn structured function invocation and preference-based alignment.

Highlights

▸ Established the full service from scratch: CI/CD, GCP deployment, training initialiser, run manager integration
▸ Supports QWEN2.5-VL, QWEN3-VL, NVILA, CosmosReason1/2, and Gemma4
▸ LoRA fine-tuning with quantization config — reduces GPU memory requirements for large model training
▸ Tensor parallelism for multi-GPU training of large models; OOM-resilient training loops
▸ VQA training with schema design, data collator, and annotation kind handling
▸ Chain-of-Thought (CoT) reasoning with structured evaluation returning phrase grounding indices
▸ Freeform open-ended generation training for CosmosReason2 and QWEN3-VL
▸ Video training via PyAV ingestion with temporal expansion and evaluation preview
▸ Intelliscribe caption generation microservice with JSON schema validation for structured outputs
▸ Tool-calling fine-tuning — training VLMs to invoke structured functions with schema validation, enabling agentic workflows and structured output dispatch in production
▸ RLHF pipeline — preference-based alignment training for VLMs, pairing human feedback with reward modelling to improve output quality and instruction-following

Tech Stack

PyTorch LoRA Quantization Tensor Parallelism RLHF Tool-Calling DeepSpeed HuggingFace PyAV GCP Docker Python

VLM Fine-Tuning Pipeline

Highlights

Tech Stack

Related Articles