[Remote] Distinguished Engineer, GPU Fleet Operations Automation
Note: The job is a remote job and is open to candidates in USA. NVIDIA is leading the industry in delivering accelerated computing in cloud and enterprise environments. As a technology leader, you will lead the development of DGX Cloud strategy for GPU fleet lifecycle, health, observability and utilization monitoring, and remediation, collaborating with cross-functional teams to deliver high operational standards.
Responsibilities
- Various Architectural Work: define and drive the technical implementation for DGX Cloud operations practice for GPU fleet lifecycle
- Collaborate on Cross Domain Disciplines: drive the technical strategy and awareness for best practices and technical capabilities into DGX Cloud engineering practices
- Accelerate Integration: Guide the technical delivery into DGX Cloud across all delivery environments: enterprise, public cloud, and high security, isolated, sovereign
- Engage Stakeholders: Collaborate with customers, infrastructure providers, and partners to ensure NVIDIA’s solutions set the industry standard for operational excellence
- Full Software and System Lifecycle: From ideation to architecture, design, development, deployment, operations, and full lifecycle management, lead all technical aspects of planning and continuous evolution of large technical scope
Skills
- 15-18+ overall years in technical roles with a focus on operations and automation for cloud infrastructure, platforms, and applications
- 5-10+ years of lead experience
- BS/MS or higher or equivalent experience in systems / software engineering, or related engineering fields
- Technical proficiency in multi-tenant data center and cloud-native architectures, with bare metal, virtualization, containerization, and higher level abstractions (IaaS, Kubernetes, Slurm), AI/ML platforms and applications
- Shown success delivering high-impact technically complex solutions that achieve high levels of transparency into resource utilization, performance, and operational insights
- Technical Leadership: Ability to synthesize multi-functional needs into architecture and design while guiding internal execution across complementary teams
- Communication and Partnership: Strong collaboration and influence skills, capable of leading engineering engagement, presenting with peers, partners, and working with high performance accelerated computing customers
- Application of Artificial Intelligence: Real world experience applying AI to component and system level issue identification and remediation
- Industry Expertise: Direct experience in designing, developing, delivering and operating highly available scaled out systems in enterprise and cloud environments
- Engineering Enablement: Demonstrated history of creating scalable processes and extensible systems that facilitate operations at scale
- Open Source Collaboration: Familiarity with open source ecosystems and projects. Ability to collaborate and influence in open source project governance, represent NVIDIA customers and partners interests in technical alignment and direction
Benefits
- You will also be eligible for equity and benefits.
Company Overview
Company H1B Sponsorship