[Remote] Principal Site Reliability Engineer - AI Infrastructure Operations

Work from home Full-time role Hiring

Note: The job is a remote job and is open to candidates in USA. Nscale is a GPU cloud company focused on providing high-performance infrastructure for AI. They are seeking a Principal Site Reliability Engineer to lead the reliability strategy and operational excellence for their AI Infrastructure Operations team, ensuring the scalability and reliability of their demanding AI platforms.

Responsibilities

Owning and evolving the long-term reliability strategy for Nscale’s AI and HPC infrastructure
Designing and leading the development of large-scale control-plane systems, automation frameworks, and operational tooling
Defining reliability standards, SLO frameworks, and operational best practices used across multiple teams
Acting as a senior technical escalation point during critical incidents, guiding resolution and ensuring systemic fixes
Identifying structural reliability risks and driving cross-functional initiatives to address them at the architectural level
Partnering with Engineering, Network Operations, and Fleet Operations leadership to influence platform design and operational maturity
Mentoring senior and mid-level engineers, raising the overall quality and effectiveness of SRE practices
Driving measurable improvements in availability, MTTR, cost efficiency, and operational scalability

Skills

10+ years of experience in Site Reliability Engineering, Systems Engineering, or Software Engineering roles operating complex, large-scale infrastructure
Expert-level software engineering skills, with a strong track record of building production-grade automation and systems
Deep expertise in Linux, networking, and distributed systems design at scale
Extensive experience debugging and resolving failures across hardware, OS, networking, and application layers
Proven ability to lead technical initiatives across teams without direct authority
Strong systems-thinking mindset, with the ability to balance reliability, velocity, and cost
Deep hands-on experience with AI or HPC platforms, including GPUs, high-speed interconnects (InfiniBand/RDMA), and workload schedulers (e.g. SLURM)
Experience designing observability systems for high-cardinality, high-throughput environments
Familiarity with Kubernetes at scale and hybrid or bare-metal cloud architectures
A history of driving step-change improvements in reliability, scalability, or operational efficiency

Benefits

Bonus
Equity
Commission programs
Medical
Dental
Vision
Flexible paid time off
Parental leave
Retirement plan participation
Human-First Flexibility: We treat you as humans first. 🫶🏽 Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.
Join our thriving remote-first team . Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.

Company Overview

Nscale builds AI data centers and provides GPU cloud infrastructure that companies use to train, run, and scale large AI models. It was founded in 2024, and is headquartered in London, England, GBR, with a workforce of 201-500 employees. Its website is https://www.nscale.com.

Apply To This Job

Apply

[Remote] Principal Site Reliability Engineer - AI Infrastructure Operations

You might like

[Remote] Senior Captive Account Manager

[Remote] Senior Operations Associate, Office of the CEO

[Remote] Senior Director, Clinical Development - Antibacterials, Antifungals, CMV & Covid 19

[Remote] Account Executive, Profile and Engagement Product Specialist

[Remote] Program Manager, C&P Strategy and Business Operations

[Remote] Staff Frontend Engineer

[Remote] Enterprise Account Executive, US

[Remote] MDW Project Manager

[Remote] Sales Operations Manager

[Remote] Data Center Project Manager - Special Projects

Head of SMS Lead Generation Strategy (Sending & Data Acquisition)

Specialty Software Engineer job at Motion Recruitment in Charlotte, NC

Apply Now: Netflix Jobs Tagging (Tagger Jobs Online) Free

Claims Specialist (Bilingual English/French)

Experienced Entry-Level Live Chat Support Agent – Customer Service Representative (Work from Anywhere, No Experience Required)

Lead RN, Concurrent Review, Remote Anywhere

Specialist, CRM (Open to Remote)

Join Today: Weekend General Manager (Remote)

Enterprise Architect (Remote - United States)

Small Business Client Service Learning Coach

[Remote] Principal Site Reliability Engineer - AI Infrastructure Operations

You might like

[Remote] Senior Captive Account Manager

[Remote] Senior Operations Associate, Office of the CEO

[Remote] Senior Director, Clinical Development - Antibacterials, Antifungals, CMV & Covid 19

[Remote] Account Executive, Profile and Engagement Product Specialist

[Remote] Program Manager, C&P Strategy and Business Operations

[Remote] Staff Frontend Engineer

[Remote] Enterprise Account Executive, US

[Remote] MDW Project Manager

[Remote] Sales Operations Manager

[Remote] Data Center Project Manager - Special Projects

Head of SMS Lead Generation Strategy (Sending & Data Acquisition)

Specialty Software Engineer job at Motion Recruitment in Charlotte, NC

Apply Now: Netflix Jobs Tagging (Tagger Jobs Online) Free

Claims Specialist (Bilingual English/French)

Experienced Entry-Level Live Chat Support Agent – Customer Service Representative (Work from Anywhere, No Experience Required)

Lead RN, Concurrent Review, Remote Anywhere

Specialist, CRM (Open to Remote)

Join Today: Weekend General Manager (Remote)

Enterprise Architect (Remote - United States)

Small Business Client Service Learning Coach

Looking for more remote jobs?