Expoint – all jobs in one place
Finding the best job has never been easier
Limitless High-tech career opportunities - Expoint

Red hat Senior Performance Resilience Engineer - LLM Inference 
United States, North Carolina, Raleigh 
915674689

Today

What you will do:

  • Own the resilience testing roadmap for vLLM and llm-d: define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD

  • Design GPU/accelerator-aware fault experiments that target vLLM and the stack beneath it (drivers, GPU Operator/DevicePlugin, NCCL/collectives, storage/network paths, NUMA/topology)

  • Build an automated harness (preferably extending krkn-chaos (https://github.com/krkn-chaos/krkn) ) to run controlled experiments with scoped blast radius, and evidence capture (logs, traces, metrics)

  • Integrate fault signals into pipelines (GitHub Actions or otherwise) as resilience gates alongside performance gates

  • Develop detection and diagnostics: dashboards and alerts for pre-fault signals (e.g., vLLM queue depth, GPU throttling, P2P downgrades, KV-cache pressure, allocator fragmentation)

  • Triage and root-cause resilience regressions from field/customer issues; upstream bugs and fixes to vLLM and llm-d

  • Explore and experiment with emerging AI technologies relevant to software development and testing, proactively identifying opportunities to incorporate new AI capabilities into existing workflows and tooling.

  • Publish learnings (internal/external): failure patterns, playbooks, SLO templates, experiment libraries, and reference architectures; present at internal/external forums

What you will bring:

  • 3+ years in reliability, and/or performance engineering on large-scale distributed systems

  • Expertise in systems‑level software design

  • Expertise with Kubernetes and modern LLM inference server stack (e.g., vLLM, TensorRT-LLM, TGI)

  • Observability & forensics skills with experience with Prometheus/Grafana, OpenTelemetry tracing, eBPF/BPFTrace/perf, Nsight Systems, PyTorch Profiler; adept at converting raw signals into actionable narratives.

  • Fluency in Python (data & ML), strong Bash/Linux skills

  • Exceptional communication skills - able to translate raw data into customer value and executive narratives

  • Commitment to open‑source values and upstream collaboration

The following is considered a plus:

  • Master’s or PhD in Computer Science, AI, or a related field

  • History of upstream contributions and community leadership, public talks or blogs on resilience, or chaos engineering

  • Competitive benchmarking and failure characterization at scale.

The salary range for this position is $127,890.00 - $211,180.00. Actual offer will be based on your qualifications.

Pay Transparency

● Comprehensive medical, dental, and vision coverage

● Flexible Spending Account - healthcare and dependent care

● Health Savings Account - high deductible medical plan

● Retirement 401(k) with employer match

● Paid time off and holidays

● Paid parental leave plans for all new parents

● Leave benefits including disability, paid family medical leave, and paid military leave