ROLESUMMARY The HPC Operations Support Engineer is responsible for providing day‑to‑day operational support for HPC environments, ensuring system stability, availability, and SLA adherence in production environments.
KEYRESPONSIBILITIES - Monitor HPC cluster health, job execution, and resource utilization.
- Provide L1/L2 user support for HPC incidents and service requests.
- Perform routine operational checks, maintenance, and housekeeping tasks.
- Support system upgrades, patches, and scheduled maintenance activities.
- Manage HPC user onboarding, access, and quota administration.
- Escalate complex issues to HPC Engineers as required.
- Maintain operational documentation, SOPs, and runbooks.
- Ensure adherence to SLAs and operational procedures.
TECHNICALSKILLS & TOOLS - Linux administration (basic to intermediate)
- Scheduler Operations: Slurm
- Monitor...