Cloud Engineer

Company Confidential Logo
Company Confidential
  • Consulting
  • FlexTime
  • FullTime
  • Seasonal

About Us

We are a staffing services technology company that helps organizations design, build, and scale digital products and engineering capabilities. Our teams deliver end-to-end software development, engineering, and design services, and we provide flexible staffing solutions to augment internal teams with specialized talent—quickly and reliably.

The Role:

We are seeking an innovative and resilient Cloud Engineer to join our distributed engineering team. This role focuses on designing, building, deploying, and operating scalable AI/ML infrastructure that enables product teams to prototype, train, and serve models with reliability and efficiency. You’ll bridge data science, backend engineering, and platform operations to ensure robust, observable, and cost-effective AI systems in production.

What You’ll Do

  • Cloud Architecture & Infra Design:Design and implement scalable, secure cloud architectures for AI/ML workloads across multiple environments (dev, staging, prod). Architect data pipelines, model training fleets, model serving endpoints, and incident response playbooks.

  • Platform & Automation:Build reusable platform components (CI/CD for ML, feature stores, model registry, experiment tracking, reusable pipelines) and automate deployment, scaling, and self-healing of AI services.

  • Model Deployment & Operations:Provision GPU/CPU clusters, manage containerized services (Docker/Kubernetes), implement inference caching, autoscaling, and canary/blue-green deployment strategies; monitor service health and model performance in production.

  • Observability & Governance:Instrument comprehensive monitoring, tracing, logging, and alerting; establish SLAs/SLOs for latency, availability, and model quality; implement cost controls and usage dashboards.

  • Collaboration & Delivery:Work closely with Data Scientists, ML Engineers, Backend Engineers, and DevOps in an Agile environment to operationalize experiments, standardize APIs, and maintain clear documentation.

  • Security & Compliance:Implement secure coding and deployment practices; manage IAM, encryption at rest/in transit, secret management, and compliance considerations for regulated data environments when applicable.

What We’re Looking For

  • Experience:3 years in cloud engineering, DevOps, or MLOps with production-grade systems; experience supporting AI/ML workloads is a plus.
  • Education:Bachelor’s or Master’s degree in Computer Science, Electrical/Computer Engineering, Mathematics, or a related field (or equivalent practical experience).
  • Cloud & Infra Proficiency:Strong hands-on experience with at least one major cloud provider (AWS, Azure, or GCP); familiarity with Kubernetes, containerization, and cloud-native services for compute, storage, and networking.
  • ML Infrastructure: Experience with ML lifecycle tooling (MLflow, Kubeflow, Weights & Biases, or equivalent) and feature stores/ML metadata management concepts; comfort with model serving frameworks and GPUs.
  • Automation & CI/CD:Proficient in CI/CD for data/ML workloads, IaC (Terraform, CloudFormation, ARM templates), Git workflows, and configuration management.
  • Programming & SRE Practices:Proficiency in Python or another language commonly used in ML ops; strong understanding of software engineering best practices (testing, code reviews, documentation).
  • Observability:Familiarity with monitoring/observability stacks (Prometheus, Grafana, OpenTelemetry, Cloud logging/monitoring services); ability to define and track SLOs/SLIs.
  • Communication: Clear written and verbal communication; ability to translate technical concepts for non-technical stakeholders.
  • Remote/Collaboration: Comfortable working asynchronously in a distributed team; self-motivated and capable of prioritizing tasks in a dynamic environment.
  • Adaptability: Comfortable handling rapid changes in priorities, diagnosing issues across distributed systems, and turning incidents into learnings.

Bonus Points

  • ML/AI Platform Experience:Hands-on with ML model training pipelines, distributed training, or serving architectures; experience with RAG, vector databases, or LLM inference at scale.
  • GPU & GPU Orchestration:Experience managing GPU clusters, job schedulers, and cost-optimized GPU usage.
  • Data Compliance:Familiarity with HIPAA, SOC 2, GDPR, or other regulatory frameworks; implementing differential privacy or federated learning considerations.
  • Industry Context:Experience deploying AI solutions in FinTech, Healthcare, E-commerce, or SaaS domains.
  • Certifications:Cloud provider certifications (e.g., AWS Certified DevOps Engineer, Azure DevOps Solutions Expert, Google Professional Cloud DevOps Engineer).

Compensation & Benefits

We believe in paying top-of-market rates for top-tier talent. The base salary range for this role is $145,000 to $200,000, with exact placement determined by your skills, years of experience, and interview performance.

Additional Benefits:

Equity: Competitive stock option package.

Remote Setup: Home office stipend to get your workspace set up perfectly.

Health: Comprehensive medical, dental, and vision insurance.

Time Off: Flexible PTO policy Company Holidays.

Growth: Annual learning and development budget.

Retirement: 401(k) matching plan.