Expoint – all jobs in one place
המקום בו המומחים והחברות הטובות ביותר נפגשים
Limitless High-tech career opportunities - Expoint

Red hat Principal Software Engineer - Performance Scale Engineering 
Israel, Center District, Raanana 
137665342

Today

What you will do:

  • Design performance test plans for new MLPerf Training and Inference suites (datacenter & edge) covering LLMs primarily.

  • Execute accuracy, performance and power submissions; triage regressions and drive improvements using LoadGen logs, traces, and other profiling tools.

  • Build and maintain highly optimized container-based harnesses (Podman/Kubernetes) for benchmark execution, dataset pre-processing steps and compliance checks for reproducible CI execution.

  • Profile kernels, GPU runtimes and distributed collectives; propose patches to Red Hat AI software stack to remove bottlenecks revealed by MLPerf™ results.

  • Represent Red Hat in MLCommons™ working groups; upstream fixes and new benchmark proposals.

  • Present results and best practices at premier open source and industry conferences; author technical blogs and white papers that translate benchmark data into customer value.

What you will bring:

  • 5+ years of relevant industry experience in performance engineering or ML infrastructure

  • Hands-on MLPerf (Training or Inference) harness work

  • Strong Python & Bash proficiency plus one systems language (Go/Rust/C++)

  • Expert Linux skills (cgroups, scheduler, perf, NUMA, GPU drivers, etc)

  • Experience with container orchestration (Kubernetes, OpenShift)

  • Performance-profiling literacy (nvprof, Nsight Systems, perf, eBPF)

  • Clear written & spoken English

The following will be considered a plus:

  • Master’s or PhD in Computer Science, AI, or a related field

  • Prior MLPerf submission ownership or MLCommons working-group membership

  • Deep knowledge of LLM inference runtimes

  • Contributions to PyTorch or TensorFlow performance patches