Back to Experience

Datature · Export: Sole Author · Inference: Co-Author

Model Export & Inference

Jul 2022 — Present

Authored the Model Export Service (Cloud Run) and co-authored the Model Hosting & Inference Service (GKE). Together these enable customers to deploy trained models to edge devices, mobile platforms, and high-throughput production environments. Extending the inference service with RAG and tool-calling capabilities — retrieval-augmented generation and structured function dispatch for production VLM deployments.

Highlights

  • Cross-framework export to TFLite, CoreML, TensorRT, OpenVINO, ONNX, TensorFlow, and PyTorch
  • Float16/Int8 quantization — up to 75% model size reduction for edge and mobile deployment
  • Model pruning up to 90% of parameters for storage-constrained and real-time applications
  • Export wrapped in multiprocessing with async apply and timeout — prevents Cloud Run request hangs under load
  • Triton Inference Server integration with Numba JIT optimisation for high-throughput production inference
  • Supports all onboarded architectures across image, video, and 3D volumetric inputs (NIfTI, DICOM)
  • Active learning pipeline with entropy-based metrics and bitmask annotation re-upload
  • RAG integration in the VLM inference layer — document context injection for retrieval-augmented generation in production deployments
  • Tool-calling dispatch at serving time — structured function invocation from VLM inference, enabling agentic workflows without leaving the inference layer

Tech Stack

RAG Tool-Calling Triton TensorRT TFLite CoreML OpenVINO ONNX Numba Cloud Run GKE Docker CUDA

Related Articles