Site Reliability Engineer (NVIDIA and Cisco UCS infrastructure)

Site Reliability Engineer

Responsibilities & Required Skills/Experience:

CI/CD systems (e.g., GitLab, GitHub Actions, Jenkins) Terraform, Ansible, Jenkins Python

Enterprise Grade Kubernetes cluster (RedHat OpenShift preferred) and/or Google Anthos AI Infrastructure SRE Engineer responsible for

Technical knowledge of high-performance compute, NVIDIA DGX/GPUs and/or Cisco Unified Compute System.

Handle availability, latency, scalability and efficiency of NVIDIA and Cisco UCS infrastructure by instilling engineering reliability into the development life cycle with a focus on fault tolerant approaches.
Drive capacity planning, performance analysis, instrumentation, and other non-functional systems requirements.
Automate operational capabilities using Python, Ansible, Terraform, Go etc.
Deliver automation through CI/CD pipeline and chatbot etc.
Implement metrics driven processes to ensure service quality targets are met.

Apply for job