Expoint – all jobs in one place
המקום בו המומחים והחברות הטובות ביותר נפגשים

דרושים Performance Engineer ב-United States, Illinois, Springfield

ממשו את הפוטנציאל שלכם בתעשיית ההייטק עם אקספוינט! חפשו הזדמנויות עבודה בתור Performance Engineer בUnited States, Illinois, Springfield והצטרפו לעוד אלפים שכבר מצאו עבודה בחברות המובילות. התחילו את המסע שלכם עוד היום ומצאו את הקריירה האידיאלית עבורכם בתור Performance Engineer עם אקספוינט.
חברה
אופי המשרה
קטגוריות תפקיד
שם תפקיד (1)
United States
Illinois
Springfield
נמצאו 1 משרות
06.09.2025
R

Red hat Senior Performance Resilience Engineer - LLM Inference United States, Illinois, Springfield

Limitless High-tech career opportunities - Expoint
Own the resilience testing roadmap for vLLM and llm-d: define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD. Design GPU/accelerator-aware fault experiments that target vLLM...
תיאור:

What you will do:

  • Own the resilience testing roadmap for vLLM and llm-d: define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD

  • Design GPU/accelerator-aware fault experiments that target vLLM and the stack beneath it (drivers, GPU Operator/DevicePlugin, NCCL/collectives, storage/network paths, NUMA/topology)

  • Build an automated harness (preferably extending krkn-chaos (https://github.com/krkn-chaos/krkn) ) to run controlled experiments with scoped blast radius, and evidence capture (logs, traces, metrics)

  • Integrate fault signals into pipelines (GitHub Actions or otherwise) as resilience gates alongside performance gates

  • Develop detection and diagnostics: dashboards and alerts for pre-fault signals (e.g., vLLM queue depth, GPU throttling, P2P downgrades, KV-cache pressure, allocator fragmentation)

  • Triage and root-cause resilience regressions from field/customer issues; upstream bugs and fixes to vLLM and llm-d

  • Explore and experiment with emerging AI technologies relevant to software development and testing, proactively identifying opportunities to incorporate new AI capabilities into existing workflows and tooling.

  • Publish learnings (internal/external): failure patterns, playbooks, SLO templates, experiment libraries, and reference architectures; present at internal/external forums

What you will bring:

  • 3+ years in reliability, and/or performance engineering on large-scale distributed systems

  • Expertise in systems‑level software design

  • Expertise with Kubernetes and modern LLM inference server stack (e.g., vLLM, TensorRT-LLM, TGI)

  • Observability & forensics skills with experience with Prometheus/Grafana, OpenTelemetry tracing, eBPF/BPFTrace/perf, Nsight Systems, PyTorch Profiler; adept at converting raw signals into actionable narratives.

  • Fluency in Python (data & ML), strong Bash/Linux skills

  • Exceptional communication skills - able to translate raw data into customer value and executive narratives

  • Commitment to open‑source values and upstream collaboration

The following is considered a plus:

  • Master’s or PhD in Computer Science, AI, or a related field

  • History of upstream contributions and community leadership, public talks or blogs on resilience, or chaos engineering

  • Competitive benchmarking and failure characterization at scale.

The salary range for this position is $127,890.00 - $211,180.00. Actual offer will be based on your qualifications.

Pay Transparency

● Comprehensive medical, dental, and vision coverage

● Flexible Spending Account - healthcare and dependent care

● Health Savings Account - high deductible medical plan

● Retirement 401(k) with employer match

● Paid time off and holidays

● Paid parental leave plans for all new parents

● Leave benefits including disability, paid family medical leave, and paid military leave

Show more
Limitless High-tech career opportunities - Expoint
Own the resilience testing roadmap for vLLM and llm-d: define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD. Design GPU/accelerator-aware fault experiments that target vLLM...
תיאור:

What you will do:

  • Own the resilience testing roadmap for vLLM and llm-d: define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD

  • Design GPU/accelerator-aware fault experiments that target vLLM and the stack beneath it (drivers, GPU Operator/DevicePlugin, NCCL/collectives, storage/network paths, NUMA/topology)

  • Build an automated harness (preferably extending krkn-chaos (https://github.com/krkn-chaos/krkn) ) to run controlled experiments with scoped blast radius, and evidence capture (logs, traces, metrics)

  • Integrate fault signals into pipelines (GitHub Actions or otherwise) as resilience gates alongside performance gates

  • Develop detection and diagnostics: dashboards and alerts for pre-fault signals (e.g., vLLM queue depth, GPU throttling, P2P downgrades, KV-cache pressure, allocator fragmentation)

  • Triage and root-cause resilience regressions from field/customer issues; upstream bugs and fixes to vLLM and llm-d

  • Explore and experiment with emerging AI technologies relevant to software development and testing, proactively identifying opportunities to incorporate new AI capabilities into existing workflows and tooling.

  • Publish learnings (internal/external): failure patterns, playbooks, SLO templates, experiment libraries, and reference architectures; present at internal/external forums

What you will bring:

  • 3+ years in reliability, and/or performance engineering on large-scale distributed systems

  • Expertise in systems‑level software design

  • Expertise with Kubernetes and modern LLM inference server stack (e.g., vLLM, TensorRT-LLM, TGI)

  • Observability & forensics skills with experience with Prometheus/Grafana, OpenTelemetry tracing, eBPF/BPFTrace/perf, Nsight Systems, PyTorch Profiler; adept at converting raw signals into actionable narratives.

  • Fluency in Python (data & ML), strong Bash/Linux skills

  • Exceptional communication skills - able to translate raw data into customer value and executive narratives

  • Commitment to open‑source values and upstream collaboration

The following is considered a plus:

  • Master’s or PhD in Computer Science, AI, or a related field

  • History of upstream contributions and community leadership, public talks or blogs on resilience, or chaos engineering

  • Competitive benchmarking and failure characterization at scale.

The salary range for this position is $127,890.00 - $211,180.00. Actual offer will be based on your qualifications.

Pay Transparency

● Comprehensive medical, dental, and vision coverage

● Flexible Spending Account - healthcare and dependent care

● Health Savings Account - high deductible medical plan

● Retirement 401(k) with employer match

● Paid time off and holidays

● Paid parental leave plans for all new parents

● Leave benefits including disability, paid family medical leave, and paid military leave

Show more
תכננו את מהלך הקריירה הבא שלכם בתעשיית ההייטק עם אקספוינט! הפלטפורמה שלנו מציעה מגוון רחב של משרות Performance Engineer באזור United States, Illinois, Springfield, ומעניקה לכם גישה לחברות הטובות ביותר בתחום. בין אם אתם מחפשים אתגר חדש או שינוי נוף, אקספוינט תקל על מציאת התאמת העבודה המושלמת עבורכם. עם מנוע החיפוש הקל לשימוש שלנו, תוכלו למצוא במהירות הזדמנויות עבודה ולחבור לחברות מובילות. הירשמו היום ועשו את הצעד הבא בקריירת ההיי-טק שלכם עם Expoint.