What I optimize for
- Latency: conversational and real-time workflows (STT/LLM/TTS) with pragmatic fallbacks.
- Cost: quantization, hybrid deterministic + ML strategies, and infra-aware model selection.
- Reliability: isolation boundaries, rate limiting, monitoring-friendly APIs, and safe failure modes.
- Scale: async pipelines and orchestration for high-volume OCR and web crawling workloads.
Representative impact
- OCR at scale: integrated production OCR to handle 10K+ documents/day.
- Efficiency: VLM quantization yielding ~4× memory reduction and ~40% infra cost reduction.
- Voice systems: end-to-end pipelines with ~sub-3s average turn latency and multi-backend TTS uptime strategy.
- Web-scale: detection over 2M+ sites and 1M+ requests/day async crawling orchestration.
Details and context live in the full CV and the main page experience timeline.