Site Reliability Engineer (NVIDIA and Cisco UCS infrastructure)

ConfigUSA Logo
  • Engineering
  • Shift
  • Applications have closed

Site Reliability Engineer

Responsibilities & Required Skills/Experience:

  1. NVIDIA (DGX) A100/ H100/ H200

  2. Cisco UCS-C885A

  3. Docker

  4. NVIDIA certificated professionals preferred

  5. Infrastructure knowledge on above skills

  6. DevOps Automation

CI/CD systems (e.g., GitLab, GitHub Actions, Jenkins) Terraform, Ansible, Jenkins Python

  1. Enterprise Grade Kubernetes cluster (RedHat OpenShift preferred) and/or Google Anthos AI Infrastructure SRE Engineer responsible for

Technical knowledge of high-performance compute, NVIDIA DGX/GPUs and/or Cisco Unified Compute System.

  • Handle availability, latency, scalability and efficiency of NVIDIA and Cisco UCS infrastructure by instilling engineering reliability into the development life cycle with a focus on fault tolerant approaches.
  • Drive capacity planning, performance analysis, instrumentation, and other non-functional systems requirements.
  • Automate operational capabilities using Python, Ansible, Terraform, Go etc.
  • Deliver automation through CI/CD pipeline and chatbot etc.
  • Implement metrics driven processes to ensure service quality targets are met.