Member of Technical Staff (AI Infrastructure Engineer)

Perplexity·San Francisco; Palo Alto·onsite
crypto:infraengineeringIC6AI
Compensation
$220k–$405k base / year (USD)
We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our Inference and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters RESPONSIBILITIES - Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads - Manage and optimize Slurm-based HPC environments for distributed training of large language models - Develop robust APIs and orchestration systems for both training pipelines and inference services - Implement resource scheduling and job management systems across heterogeneous compute environments - Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure - Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm - Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services - Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands QUALIFICATIONS - Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management - Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization - Experience with deploying and managing distributed training systems at scale - Deep understanding of container orchestration and distributed systems architecture - High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies) - Experience managing GPU clusters and optimizing compute resource utilization REQUIRED SKILLS - Expert-level Kubernetes administration and YAML configuration management - Pr