AI Infrastructure Platform Engineer

Technogen, Inc. Logo
  • Construction
  • Shift

TECHNOGEN, Inc. is a Proven Leader in providing full IT Services, Software Development and Solutions for 15 years.

TECHNOGEN is a Small & Woman Owned Minority Business with GSA Advantage Certification. We have offices in VA; MD & Offshore development centers in India. We have successfully executed 100 projects for clients ranging from small business and non-profits to Fortune 50 companies and federal, state and local agencies.

Description:
Local candidates only.

In This Role, You Will

  • Lead complex infrastructure initiatives supporting Generative AI and Predictive AI platforms from design to production operations.
  • Serve as a technical lead for platforms supporting AI/ML model training, inference, and batch workloads.
  • Design, build, deploy, and operate OpenShift-based container platforms optimized for high-performance GPU workloads.
  • Build, support and operate scalable GPU SuperPod architecture with large multi-node GPU clusters.
  • Own monitoring, alerting, and observability using Grafana, Splunk, and enterprise telemetry tools.
  • Define SLIs/SLOs and build actionable alerts to proactively detect performance, capacity, and resiliency risks.
  • Build AI- and agent-based automation tools for self-healing, scaling, diagnostics, and incident remediation.
  • Apply AIOps techniques to reduce alert fatigue and improve platform reliability.
  • Lead production incident analysis and ensure operational rigor and root-cause prevention.
  • Mentor engineers and influence stakeholders across a geographically distributed organization.

Required Qualifications

  • 5 years of infrastructure engineering experience.
  • 5 years troubleshooting complex end-to-end architectures(including CI/CD pipeline).
  • 5 years Linux systems experience.
  • 4 years supporting AI/ML platforms.
  • 4 years of Kubernetes / container platform experience including production support.

Desired Qualifications

  • Experience with Generative AI and Predictive AI platforms.
  • Hands-on GPU platform operations including scheduling, quota, and performance tuning.
  • Experience with OpenShift in GPU-enabled, multi-tenant environments.
  • Experience designing or operating GPU Super Pods.
  • Deep experience with observability using Grafana, Splunk, and custom telemetry pipelines.
  • Experience building AI- or agent-driven automation tooling (AIOps).
  • Hands-on experience supporting AI/ML workloads on Google Cloud Platform and Azure, including GPU-backed services and managed AI infrastructure
  • Experience operating hybrid or multi-cloud AI platforms, with an understanding of cloud-native services, networking, identity, and cost optimization for Generative and Predictive AI
  • Strong monitoring of AI signals such as inference latency and GPU utilization.
  • Experience with BCP/DR, resiliency, and highly available architectures.

Job Expectations

  • Participation in a 24×7 on-call rotation.
  • Ownership for production stability, platform health, and customer outcomes.
  • Operate in regulated enterprise environments with strong risk and control focus.

Additional Details

  • Skill Category : Regular
  • ICIMS RR ID # : NA
  • Client Name : Wells Fargo & Company
  • Engagement Type : T&M Competitive 1-3
  • Vertical : BFSI
  • Must-Have Primary Skill : Application Management-Development-Kubernetes
  • Primary Skill: Yrs Experience : Skilled (3 to 5 Years Experience)
  • RC- Domain : Cloud & Virtualization
  • RC- Subdomain : 11 – External Cloud Server Services, Cloud Integration, Citrix Services, Vmware Technologies, IoT Cloud Platforms, Cloud Development and Migration, VMware
  • RC- Role : Developer
  • RC- Experience Level : III
  • RC- Geo Tier : TBD
  • RC- Exception Required : TBD
  • COVID-19 Vaccine Required? : (No Value)
  • Client-Accepted Visa Types : (No Value)