Job DescriptionKey Responsibilities:- System Reliability and Monitoring: Design and implement monitoring, alerting, and automation for S3 storage clusters to achieve 99.99%+ uptime. Use tools like Prometheus, Grafana, or Catchpoint to track performance metrics, capacity utilization, and anomaly detection.- Capacity Planning and Scaling: Forecast storage needs based on data growth trends (e.G., fleet expansion exceeding 80 PB) and proactively scale S3 buckets, lifecycle policies, and multi-region replication to support up to 150 PB+ capacities.- Incident Management: Lead on-call rotations, troubleshoot storage-related incidents (e.G., data access latency, replication failures), and perform root cause analysis using methodologies like blameless post-mortems.- Automation and Infrastructure as Code: Develop and maintain automation scripts (e.G., using Terraform, Ansible, or Python) for provisioning, configuring, and managing S3 resources, including security policies, encryption, and access...