Senior - AI Engineer, Inference
Trung tâm Công nghệ Thông tin
Hồ Chí Minh
26-ITC-0268
- We are seeking a Senior AI Engineer (Inference) to design, optimize, and scale high-performance AI inference systems in production;
- This role focuses on serving large-scale AI models (LLMs, VLMs, Computer Vision models) with high throughput, low latency, and strong reliability. You will work closely with AI researchers, backend engineers, and platform teams to deliver real-time AI capabilities powering millions of requests;
- You are expected to deeply understand model inference optimization, distributed systems, GPU utilization, and production-grade deployment.
Mô tả công việc
1. Inference System Design & Optimization
- Design and implement high-performance inference pipelines for LLM/VLM/Computer Vision models;
- Optimize latency, throughput, and memory usage (batching, caching, KV cache, quantization);
- Deploy and tune inference engines such as: vLLM / Triton Inference Server / SGLang;
2. Distributed & Scalable Serving
- Build scalable systems handling to millions of requests
- Implement:
- Model sharding / tensor parallelism;
- Multi-GPU / multi-node inference;
- Autoscaling strategies (Kubernetes-based)
- Optimize GPU utilization on systems like:
- NVIDIA DGX A100
3. Production Deployment & Reliability
- Build and maintain production APIs (OpenAI-compatible, REST/gRPC)
- Integrate observability tools :Metrics (Prometheus) , Logs (Grafana, VictoriaLogs)
Yêu cầu công việc
- 3+ years experience in backend / AI systems;
- Bachelor’s degree in Computer Science, Engineering or a related field;
- Strong backend skills (Python) and experience building scalable APIs with streaming and request scheduling;
- Hands-on experience with LLM/VLM/Computer Vision inference in production , GPU optimization;
- Experience with vLLM / SGLang / Triton Inference Server for building and optimizing high-throughput, low-latency LLM inference systems;
- Familiarity with cloud platforms (AWS, GCP, Azure);
- Hands-on with databases (e.g., PostgreSQL, Redis, MongoDB) for caching, logging, and real-time data serving in AI systems.
Bạn có quan tâm đến vị trí này?
hoặc bạn biết một ứng viên phù hợp
