[Remote] Principal Site Reliability Engineer - AI Infrastructure Operations
Note: The job is a remote job and is open to candidates in USA. Nscale is a GPU cloud company focused on providing high-performance infrastructure for AI. They are seeking a Principal Site Reliability Engineer to lead the reliability strategy and operational excellence for their AI Infrastructure Operations team, ensuring the scalability and reliability of their demanding AI platforms.
Responsibilities
- Owning and evolving the long-term reliability strategy for Nscale’s AI and HPC infrastructure
- Designing and leading the development of large-scale control-plane systems, automation frameworks, and operational tooling
- Defining reliability standards, SLO frameworks, and operational best practices used across multiple teams
- Acting as a senior technical escalation point during critical incidents, guiding resolution and ensuring systemic fixes
- Identifying structural reliability risks and driving cross-functional initiatives to address them at the architectural level
- Partnering with Engineering, Network Operations, and Fleet Operations leadership to influence platform design and operational maturity
- Mentoring senior and mid-level engineers, raising the overall quality and effectiveness of SRE practices
- Driving measurable improvements in availability, MTTR, cost efficiency, and operational scalability
Skills
- 10+ years of experience in Site Reliability Engineering, Systems Engineering, or Software Engineering roles operating complex, large-scale infrastructure
- Expert-level software engineering skills, with a strong track record of building production-grade automation and systems
- Deep expertise in Linux, networking, and distributed systems design at scale
- Extensive experience debugging and resolving failures across hardware, OS, networking, and application layers
- Proven ability to lead technical initiatives across teams without direct authority
- Strong systems-thinking mindset, with the ability to balance reliability, velocity, and cost
- Deep hands-on experience with AI or HPC platforms, including GPUs, high-speed interconnects (InfiniBand/RDMA), and workload schedulers (e.g. SLURM)
- Experience designing observability systems for high-cardinality, high-throughput environments
- Familiarity with Kubernetes at scale and hybrid or bare-metal cloud architectures
- A history of driving step-change improvements in reliability, scalability, or operational efficiency
Benefits
- Bonus
- Equity
- Commission programs
- Medical
- Dental
- Vision
- Flexible paid time off
- Parental leave
- Retirement plan participation
- Human-First Flexibility: We treat you as humans first. 🫶🏽 Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.
- Join our thriving remote-first team . Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.
Company Overview