Key Responsibilities Lead reliability engineering projects, driving them from concept to completion. Ensure system stability, scalability, and high availability across production environments. Design, build, and maintain efficient and scalable cloud-based infrastructure and services. Implement and enhance observability solutions for monitoring, alerting, and logging (Grafana, Splunk, Dynatrace). Automate manual processes using scripting languages such as Python, Bash, or PowerShell. Work with CI/CD and configuration management tools (Jenkins, GitLab, Ansible, or Chef). Manage containerized and orchestrated environments using Docker and Kubernetes. Drive incident response, conduct root cause analysis and blameless postmortems. Ensure compliance with SLIs, SLOs, SLAs, and error budgets while minimizing production downtime. Provide on-call support and proactive troubleshooting for critical production systems.