Site Reliability Engineer (NVIDIA and Cisco UCS infrastructure)
Site Reliability Engineer
Responsibilities & Required Skills/Experience:
-
NVIDIA (DGX) A100/ H100/ H200
-
Cisco UCS-C885A
-
Docker
-
NVIDIA certificated professionals preferred
-
Infrastructure knowledge on above skills
-
DevOps Automation
CI/CD systems (e.g., GitLab, GitHub Actions, Jenkins) Terraform, Ansible, Jenkins Python
- Enterprise Grade Kubernetes cluster (RedHat OpenShift preferred) and/or Google Anthos AI Infrastructure SRE Engineer responsible for
Technical knowledge of high-performance compute, NVIDIA DGX/GPUs and/or Cisco Unified Compute System.
- Handle availability, latency, scalability and efficiency of NVIDIA and Cisco UCS infrastructure by instilling engineering reliability into the development life cycle with a focus on fault tolerant approaches.
- Drive capacity planning, performance analysis, instrumentation, and other non-functional systems requirements.
- Automate operational capabilities using Python, Ansible, Terraform, Go etc.
- Deliver automation through CI/CD pipeline and chatbot etc.
- Implement metrics driven processes to ensure service quality targets are met.