Provide first and second‑line technical support to customers for AI Infrastructure, including GPU/CPU nodes, networking, storage, orchestration, and platform services. Support is delivered via ticketing systems, emails, Slack, or other messaging platforms.
Support GPU cluster delivery, including system provisioning, image deployment, network validation, BIOS/firmware updates, and GPU driver/runtime installation.
Monitor system health and service‑level indicators using alerts and dashboards; respond to alerts 24x7 as scheduled.
Triage incidents by gathering context, verifying scope and impact, and following standard operating procedures and runbooks to perform immediate mitigations.
Escalate incidents to global SRE engineers with clear, concise incident notes and relevant logs/traces.