Datature · Export: Sole Author · Inference: Co-Author

Model Export & Inference

Jul 2022 — Present

Authored the Model Export Service (Cloud Run) and co-authored the Model Hosting & Inference Service (GKE). Together these enable customers to deploy trained models to edge devices, mobile platforms, and high-throughput production environments. Extending the inference service with RAG and tool-calling capabilities — retrieval-augmented generation and structured function dispatch for production VLM deployments.

Highlights

▸ Cross-framework export to TFLite, CoreML, TensorRT, OpenVINO, ONNX, TensorFlow, and PyTorch
▸ Float16/Int8 quantization — up to 75% model size reduction for edge and mobile deployment
▸ Model pruning up to 90% of parameters for storage-constrained and real-time applications
▸ Export wrapped in multiprocessing with async apply and timeout — prevents Cloud Run request hangs under load
▸ Triton Inference Server integration with Numba JIT optimisation for high-throughput production inference
▸ Supports all onboarded architectures across image, video, and 3D volumetric inputs (NIfTI, DICOM)
▸ Active learning pipeline with entropy-based metrics and bitmask annotation re-upload
▸ RAG integration in the VLM inference layer — document context injection for retrieval-augmented generation in production deployments
▸ Tool-calling dispatch at serving time — structured function invocation from VLM inference, enabling agentic workflows without leaving the inference layer

Tech Stack

RAG Tool-Calling Triton TensorRT TFLite CoreML OpenVINO ONNX Numba Cloud Run GKE Docker CUDA

Model Export & Inference

Highlights

Tech Stack

Related Articles