What You'll Own:
LLM Platform Architecture — Actively participate in the design and evolution of the core infrastructure platform supporting LLM training, fine-tuning, and inference workloads at scale, contributing architectural decisions that balance performance, cost, and reliability across the full platform lifecycle.
Kubernetes & Advanced Autoscaling — Own the design and implementation of sophisticated K8s autoscaling strategies (HPA, VPA, KEDA, Cluster Autoscaler) tailored to the highly variable and GPU-intensive demands of LLM workloads. Our current environment is EKS, though equivalent production Kubernetes experience on GKE, AKS, or on-prem is equally valued.
ML Workflow Orchestration — Participate in the engineering and optimization of ML pipeline infrastructure, contributing to best practices for pipeline design, resource allocation, and workflow reliability across LLM training and evaluation workloads. We currently use Flyte — experience with comparable platforms such as Kubeflow, Airflow, or Prefect translates well.
AI Developer Platform — Own and contribute to the architecture and operations of interactive compute environments used by AI researchers and LLM engineers to develop, experiment, and prototype. We run JupyterHub today, though experience with equivalent multi-user ML development platforms is directly applicable.
CI/CD & GitOps — Participate in the development and ongoing improvement of GitOps workflows and CI/CD pipelines, contributing to deployment best practices and enabling rapid, reliable delivery of platform changes. Our current implementation uses ArgoCD — strong experience with GitOps principles and comparable tooling is what matters.
Observability & Reliability — Contribute to the full observability stack implementation — designing dashboards, defining SLOs, building alerting frameworks, and ensuring deep visibility into LLM workload performance and platform health. We use the LGTM stack (Loki, Grafana, Tempo, Mimir) — experience with Prometheus, OpenTelemetry, ELK, Datadog, or equivalent platforms is welcomed.
Cloud Infrastructure — Participate in cloud infrastructure design across compute (including GPU instance families), storage, networking, and IAM, with a strong emphasis on cost optimization and operational excellence. Our primary cloud is AWS — candidates with strong GCP or Azure backgrounds who are prepared to work in AWS are encouraged to apply.
Security & Compliance — Engage actively in the vulnerability assessment and remediation program across all platform components, contributing to security standards and ensuring the LLM platform meets organizational and regulatory compliance requirements.
Collaborative Engineering — Participate in technical design reviews, contribute to roadmap discussions, and serve as a knowledgeable resource and collaborative partner across AIOps and MLOps disciplines
Required Experience & Skills:
7+ years of experience in DevOps, Platform Engineering, MLOps, or a closely related infrastructure discipline.
Deep Kubernetes expertise — production experience operating Kubernetes at scale on any major managed platform (EKS, GKE, AKS) or on-premises, with advanced knowledge of scheduling, autoscaling, networking, RBAC, and cluster operations.
Cloud infrastructure proficiency — extensive experience designing and operating production workloads on at least one major cloud provider (AWS, GCP, or Azure), covering compute, storage, networking, and identity and access management
MLOps / AI Infrastructure experience — demonstrated experience building and operating infrastructure that supports ML training, model serving, or LLM workloads, including GPU resource management and scheduling at scale
CI/CD & GitOps — strong hands-on experience with GitOps principles and modern CI/CD pipeline design, using any mainstream tooling (ArgoCD, Flux, GitHub Actions, Tekton, or equivalent)
Observability Engineering — production experience designing and operating observability platforms including metrics, logging, and distributed tracing, using any modern stack (Grafana/LGTM, Prometheus, Datadog, ELK, or equivalent)
Infrastructure as Code — strong proficiency with Terraform, Helm, or comparable IaC and configuration management tooling.
Programming & Scripting — solid coding ability in Python and/or Go, with experience writing automation, tooling, and infrastructure integrations.
Security Mindset — hands-on experience with vulnerability scanning, remediation workflows, and cloud security best practices including RBAC hardening and secrets management
Strongly Preferred:
Direct experience with Flyte or comparable ML workflow orchestration platforms (Kubeflow, Airflow, Prefect, Metaflow)
Experience operating JupyterHub or equivalent multi-user interactive compute platforms at scale
Familiarity with LLM-specific infrastructure — model serving frameworks (vLLM, Triton, TorchServe), GPU cluster management, large-scale distributed training setups
Hands-on experience with AWS (EKS, EC2 GPU families, S3, IAM, VPC) as our current primary cloud environment
Experience with FinOps practices — cloud cost attribution, rightsizing, and spot/preemptible instance strategies for ML workloads
Relevant certifications: CKA / CKS, AWS/GCP/Azure Solutions Architect or DevOps Engineer, or equivalent
Who You Are:
A systems thinker who understands how architectural decisions ripple across reliability, performance, cost, and security — regardless of which cloud or tooling stack those decisions are made within
Operationally minded — you build things to be observable, maintainable, and resilient from day one
Deeply curious about AI and LLMs — you understand why the infrastructure you build matters and stay current with how the AI landscape is evolving
Proactive and ownership-driven — you identify problems before they become incidents and drive solutions to completion
An effective collaborator and communicator who can translate complex infrastructure concepts for AI researchers, data scientists, and engineering leadership alike
Comfortable operating with autonomy in a fast-moving environment where priorities evolve alongside the AI landscape
Why This Role Stands Out:
LLM infrastructure is one of the most technically demanding and strategically important engineering domains in the industry today. As a senior member of our AIOps team you will:
Directly shape the platform that enables LLM development and productionization — your contributions will have immediate, measurable impact
Work on genuinely hard infrastructure problems — GPU scheduling, large-scale distributed workloads, high-throughput model serving, and multi-tenant ML environments
Be positioned at the epicenter of the AI infrastructure space, one of the fastest growing and highest-demand engineering disciplines in the industry
Have a clear voice in technical direction — your experience and opinions on platform design are genuinely valued and actively sought
Bring your full experience to the table — whether you've built on AWS, GCP, Azure, or hybrid environments, your platform engineering expertise is what drives impact here
We’re committed to providing accommodations by request for candidates taking part in all aspects of the recruitment and selection process. For a confidential inquiry or to request an accommodation, please contact your recruiter or email [email protected].