Job Description

Join a stealth-mode startup building out their AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready for experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access. As well as supporting their extremely exciting new products coming to the market!

This is a rare opportunity to work at the intersection of AI infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment.

If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it. Get in touch and apply today!

Responsibilities:

Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow.
Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have:

7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management.
Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Salary & Benefits:

$300,000 gross per year
Equity

Job Tags

Permanent employment

Similar Jobs

NuLife Institute

Medical Processor (Pharmacy Technician) Job at NuLife Institute

...using your very own Internal Blueprint. We are searching for a driven and customer service oriented Medical Processor/Pharmacy Technician to process medication treatment programs to help drive our patient retention and practice operations success. This person...

Orion Placement

Litigation Legal Secretary Job at Orion Placement

...a long-standing reputation for high-quality legal work and client service. Step into a role where your litigation experience will be valued and relied on every day. Work... ...documents File documents with courts and assist with service of process and related follow-...

Wegmans

Overnight Security Officer - EMT Job at Wegmans

Schedule: Full time Availability: Morning, Overnight (Includes Weekends). Age Requirement: Must be 18 years or older Location: Pottsville, PA Address: 820 Keystone Blvd Pay: $35 / hour Job Posting: 06/01/2026 Job Posting End: 06/29/2026 Job ID: R0282700 EARN A BONUS UP ...

Brighter Day Services

Lead Generalist Supervisor Job at Brighter Day Services

...with service providers quickly and efficiently. We are building a high-performance sales and operations team and are seeking a Lead Generalist Supervisor to take on a key leadership role focused on revenue growth, team development, and operational execution. Role...

Falcomm

RFIC Thermal Engineer Job at Falcomm

...Are you passionate about solving complex thermal challenges in advanced semiconductor technologies? At Falcomm, we are transforming innovative... ...and reliability. Falcomm is seeking an RFIC Thermal Engineer to analyze, model, and optimize thermal performance in RF integrated...

Senior Site Reliability Engineer (SRE) - AI Inftastructure Job at Confidential, San Francisco, CA

MXZGc2g0VzdPdzhjNUdxQUlIOTVmb0RIQVE9PQ==