[Remote] Executive Director, AI Infrastructure & Platform Engineering
Note: The job is a remote job and is open to candidates in USA. CVS Health is dedicated to shaping a more connected and compassionate health experience. They are seeking an Executive Director for AI Infrastructure & Platform Engineering, responsible for leading the development and operational excellence of their AI compute platform, ensuring high availability and reliability for frontier AI workloads.
Responsibilities
- Define and execute the long-range vision and strategy for AI infrastructure and platform engineering, with availability (>99.99%), reliability, and platform performance as the primary measures of success
- Recruit, hire, develop, and retain a high-performing engineering organization spanning infrastructure, network, platform reliability, observability, security, 24/7 operations, change and release management, and FinOps
- Establish clear ownership, accountability, and performance expectations across all functional teams; foster a culture of operational excellence, engineering rigor, and continuous improvement
- Provide executive-level communication to senior leadership on platform status, milestones, risk posture, and strategic initiatives
- Own the physical layer of the AI compute environment — GPU compute, storage, network fabric, capacity planning, and hardware lifecycle accountability
- Direct bare-metal Kubernetes and OpenShift operations, including cluster administration, GPU quota governance, infrastructure-as-code adoption, and availability baseline enforcement
- Govern high-performance network fabric operations — RoCE v2, spine-leaf topology, lossless Ethernet tuning, congestion management, and segmentation
- Establish and enforce operational baselines across every layer of the stack — hardware, fabric, platform, and workload — with deviations detected, escalated, and resolved within defined SLAs
- Direct Innovation POD strategy to develop self-healing and autonomous capabilities that proactively prevent service degradation before it impacts availability
- Build and sustain a high-performing 24/7 operations model — designed for sustainable, predictable coverage with no mandatory overtime and measurable team health and retention
- Drive end-to-end observability across the physical and platform layers, with continuous feedback loops connecting monitoring data to incident response, change decisions, and improvement cycles
- Oversee change management so every modification is risk-assessed, monitored during rollout, and baseline-validated post-deployment
- Ensure configuration consistency and drift detection across all platform components to prevent baseline degradation over time
- Lead GPU FinOps governance — utilization optimization, tenant quota enforcement, and cost reduction — in partnership with the Finance organization
- Empower the Security SRE Lead to maintain a world-class security posture across the infrastructure and platform layers, with robust compliance to frameworks including HIPAA and NIST AI RMF
- Govern access controls, audit logging, vulnerability management, and network segmentation across the AI compute environment
- Lead the operational transition from program-launch staffing to permanent CVS-owned operations — governing phased handoffs, competency validation, and milestone sign-offs to ensure minimal disruption to platform availability and business operations
- Establish and lead the long-term operating model by institutionalizing key technical, architectural, and delivery leadership capabilities into permanent CVS roles, ensuring the organization is fully self-sustaining at program close
- Own vendor relationships, contract performance, and accountability across the hardware, networking, platform, and managed-services stack
- Manage budget ownership for the AI infrastructure and platform engineering organization, including capital planning and operational expense governance
Skills
- 10+ years of engineering leadership experience, with substantial time directly owning physical infrastructure at data center scale — including hardware lifecycle, capacity planning, and facility coordination (power, cooling, rack-and-stack execution)
- Hands-on production ownership of bare-metal Kubernetes or OpenShift. Managed cloud services (EKS, GKE, AKS) alone do not substitute for the practitioner expertise this role requires
- Fluency with high-speed cluster fabrics — RoCE v2, InfiniBand, EVPN-VXLAN, or carrier-grade equivalent — and the operational discipline these fabrics require (PFC, ECN, lossless tuning, congestion management)
- 5+ years leading multiple technical teams simultaneously, including 24/7 operations organizations, with measurable team health, retention, and performance outcomes
- Proven success establishing and enforcing operational baselines, SLO / SLI / error-budget frameworks, and observability-driven continuous improvement in physical-infrastructure-anchored environments
- Hardware lifecycle, vendor accountability, and facility coordination experience — including capacity planning, RMA management, and multi-vendor escalation
- Experience leading operational transitions or organizational build-outs at scale, with business continuity and minimal disruption as non-negotiables
- Executive-level stakeholder communication, vendor negotiation, and budget ownership
- Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or related technical field
- Hands-on experience with Cisco UCS, NVIDIA HGX / DGX / Blackwell systems, and VAST or comparable distributed NVMe storage
- Direct experience operating GPU clusters of 32 or more GPUs in production environments — including HPC, AI training, research computing, or comparable workloads
- NVIDIA AI Enterprise, NVIDIA Run:AI, NVIDIA Base Command Manager, or comparable GPU orchestration platform experience
- Healthcare or other regulated-industry background (HIPAA, NIST AI RMF, SOX, FedRAMP, ITAR)
- Chaos engineering and AI-driven operations experience — predictive alerting and automated remediation patterns
- Background in innovation programs, POD structures, or centers of excellence
Benefits
- This position is eligible for a CVS Health bonus, commission or short-term incentive program in addition to the base pay range listed above.
- This position also includes an award target in the company’s equity award program.
- This full‑time position is eligible for a comprehensive benefits package designed to support the physical, emotional, and financial well‑being of colleagues and their families.
- The benefits for this position include medical, dental, and vision coverage, paid time off, retirement savings options, wellness programs, and other resources, based on eligibility.
Company Overview