Be at the forefront of infrastructure reliability as an AI Infrastructure Site Reliability Engineer. Focus on maintaining system performance, security, and incident management to support our growing platform.
You'll collaborate with a small yet passionate infrastructure team, working closely with DevOps and leadership to enhance the reliability of AI systems. This hands-on role demands your proactive approach in automating processes, improving observability, and ensuring services run cost-efficiently in production.
Key Responsibilities:
• Sustain platform uptime and availability metrics
• Optimize and secure infrastructure
• Resolve scaling issues proactively
• Collaborate on troubleshooting with product engineers
• Build and maintain observability systems
Requirements:
• Proven experience in Site Reliability Engineering or related field
• Familiarity with Elixir desirable
• Operating experience with Kubernetes clusters
• Competence with Terraform