Datature · Export: Sole Author · Inference: Co-Author
Model Export & Inference
Jul 2022 — Present
Authored the Model Export Service (Cloud Run) and co-authored the Model Hosting & Inference Service (GKE). Together these enable customers to deploy trained models to edge devices, mobile platforms, and high-throughput production environments. Extending the inference service with RAG and tool-calling capabilities — retrieval-augmented generation and structured function dispatch for production VLM deployments.
Highlights
▸Cross-framework export to TFLite, CoreML, TensorRT, OpenVINO, ONNX, TensorFlow, and PyTorch
▸Float16/Int8 quantization — up to 75% model size reduction for edge and mobile deployment
▸Model pruning up to 90% of parameters for storage-constrained and real-time applications
▸Export wrapped in multiprocessing with async apply and timeout — prevents Cloud Run request hangs under load
▸Triton Inference Server integration with Numba JIT optimisation for high-throughput production inference
▸Supports all onboarded architectures across image, video, and 3D volumetric inputs (NIfTI, DICOM)
▸Active learning pipeline with entropy-based metrics and bitmask annotation re-upload
▸RAG integration in the VLM inference layer — document context injection for retrieval-augmented generation in production deployments
▸Tool-calling dispatch at serving time — structured function invocation from VLM inference, enabling agentic workflows without leaving the inference layer
Tech Stack
RAG Tool-Calling Triton TensorRT TFLite CoreML OpenVINO ONNX Numba Cloud Run GKE Docker CUDA