The point where experts and best companies meet

Senior Engineer jobs in United States, Illinois, Springfield

Unlock your potential in the high tech industry with Expoint. Search for job opportunities as a Senior Engineer in United States, Illinois, Springfield and join the network of leading companies. Start your journey today and find your dream job as a Senior Engineer with Expoint.

Company

Job type

Job categories

Job title (1)

United States

Illinois

Springfield

Creation date

2 jobs found

06.09.2025

Red hat Senior Performance Resilience Engineer - LLM Inference United States, Illinois, Springfield

Limitless High-tech career opportunities - Expoint

Own the resilience testing roadmap for vLLM and llm-d: define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD. Design GPU/accelerator-aware fault experiments that target vLLM...

Description:

What you will do:

Own the resilience testing roadmap for vLLM and llm-d: define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD
Design GPU/accelerator-aware fault experiments that target vLLM and the stack beneath it (drivers, GPU Operator/DevicePlugin, NCCL/collectives, storage/network paths, NUMA/topology)
Build an automated harness (preferably extending krkn-chaos (https://github.com/krkn-chaos/krkn) ) to run controlled experiments with scoped blast radius, and evidence capture (logs, traces, metrics)
Integrate fault signals into pipelines (GitHub Actions or otherwise) as resilience gates alongside performance gates
Develop detection and diagnostics: dashboards and alerts for pre-fault signals (e.g., vLLM queue depth, GPU throttling, P2P downgrades, KV-cache pressure, allocator fragmentation)
Triage and root-cause resilience regressions from field/customer issues; upstream bugs and fixes to vLLM and llm-d
Explore and experiment with emerging AI technologies relevant to software development and testing, proactively identifying opportunities to incorporate new AI capabilities into existing workflows and tooling.
Publish learnings (internal/external): failure patterns, playbooks, SLO templates, experiment libraries, and reference architectures; present at internal/external forums

What you will bring:

3+ years in reliability, and/or performance engineering on large-scale distributed systems
Expertise in systems‑level software design
Expertise with Kubernetes and modern LLM inference server stack (e.g., vLLM, TensorRT-LLM, TGI)
Observability & forensics skills with experience with Prometheus/Grafana, OpenTelemetry tracing, eBPF/BPFTrace/perf, Nsight Systems, PyTorch Profiler; adept at converting raw signals into actionable narratives.
Fluency in Python (data & ML), strong Bash/Linux skills
Exceptional communication skills - able to translate raw data into customer value and executive narratives
Commitment to open‑source values and upstream collaboration

The following is considered a plus:

Master’s or PhD in Computer Science, AI, or a related field
History of upstream contributions and community leadership, public talks or blogs on resilience, or chaos engineering
Competitive benchmarking and failure characterization at scale.

The salary range for this position is $127,890.00 - $211,180.00. Actual offer will be based on your qualifications.

Pay Transparency

● Comprehensive medical, dental, and vision coverage

● Flexible Spending Account - healthcare and dependent care

● Health Savings Account - high deductible medical plan

● Retirement 401(k) with employer match

● Paid time off and holidays

● Paid parental leave plans for all new parents

● Leave benefits including disability, paid family medical leave, and paid military leave

Full job details

These jobs might be a good fit

Red hat Senior Performance Resilience Engineer - LLM Inference United States, District of Columbia, Washington

Red hat Senior Performance Resilience Engineer - LLM Inference United States, Colorado, Denver

Red hat Senior Performance Resilience Engineer - LLM Inference United States, New York, City of Albany

Red hat Senior Performance Resilience Engineer - LLM Inference United States, California, Sacramento

17.04.2025

Red hat Senior Consultant OpenShift Infrastructure Virtualization United States, Illinois, Springfield

Implement automated, containerized cloud application platform solutions with a focus on infrastructure concerns including networking, storage, virtualization, security, logging, monitoring, and high availability and system resilience. Learn new technologies quickly,...

Description:

About The Job

This position requires regular on-site work with clients across North America, so a willingness to travel to customer locations 30-40 weeks per year is required. Applicants must reside within close proximity to a primary airport.

What You Will Do

• Implement automated, containerized cloud application platform solutions with a focus on infrastructure concerns including networking, storage, virtualization, security, logging, monitoring, and high availability and system resilience

• Learn new technologies quickly, including container orchestration, container registries, container build strategies, cloud storage, and software-defined networks

• Travel frequently to work alongside leading financial services, retail, telecommunication, and institutional customers

After joining Red Hat, you will go through an intensive training program on Kubernetes and OpenShift technologies and related DevOps and GitOps topics. Here's how your skill set will evolve and what you'll learn during your first year in the role:

• Understanding of how to build production-ready container and virtualization platforms, integrated with existing enterprise systems

• Knowledge of how to deploy source code into running, scalable containers and virtual machines in automated fashion at enterprise scale

• Practical experience with our offerings

Within 6 months, be ready to implement a routine container platform project by attaining the following:

• Successful, collaborative delivery of customer requirements using Red Hat OpenShift

• Knowledge of how a customer use case can be developed into a project plan and how those requirements align with Red Hat’s technologies

• Understanding of how Red Hat’s technologies can transform software delivery (DevOps/GitOps) practices at large organizations

Within 12 months, begin to demonstrate technical leadership in container platforms by accomplishing the following:

• Successfully implementing complex, large-scale container platform solutions in challenging customer environments

• Helping other peers learn DevOps/GitOps paired with container technologies

• Contributing lessons learned, best practices, and how-tos to our internal and external communities of practice

• Applying new technologies, frameworks, or methodologies to container platforms

What You Will Bring

• Experience leading successful modern cloud platform consulting engagements

• Broad and deep technical experience with VMware ESXi software, including vCenter, VM lifecycle operations using VMware tools

• Logging and alerting functions for SRE operations available from VMware and how they integrate with non-VMware enterprise systems, such as Splunk

• Expert level with VMware VM foundational technologies for networks ([S/DV]Switches/NSX)

• Expert level with VMware VM foundational technologies for storage (Datastores, vSAN, vVols)

• Knowledge of common Add-Ons or third party tools, including Aria/vRealize Suite, SRM, Backup software, Performance tools)

• Experience with technologies including OpenStack, Red Hat Virtualization, Microsoft Hyper-V, Amazon Web Services, and Microsoft Azure a plus

• Experience across one or more vertical industry areas

• Demonstrated track record of working in a strategic advisory role to senior IT and business executives

• Applied knowledge and experience working in agile, scrum, and DevOps teams

• Excellent written, verbal communication and presentation skills

• Willingness to travel to customer locations about 30-40 weeks per year on average across North America

• Degree in computer science or a technical discipline

The salary range for this position is $111,260.00 - $183,580.00. Actual offer will be based on your qualifications.

Pay Transparency

● Comprehensive medical, dental, and vision coverage

● Flexible Spending Account - healthcare and dependent care

● Health Savings Account - high deductible medical plan

● Retirement 401(k) with employer match

● Paid time off and holidays

● Paid parental leave plans for all new parents

● Leave benefits including disability, paid family medical leave, and paid military leave

Full job details

These jobs might be a good fit

Red hatSenior Performance Resilience Engineer - LLM Inference

United States, Illinois, Springfield

780558481

06.09.2025

Description:

What you will do:

Own the resilience testing roadmap for vLLM and llm-d: define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD
Design GPU/accelerator-aware fault experiments that target vLLM and the stack beneath it (drivers, GPU Operator/DevicePlugin, NCCL/collectives, storage/network paths, NUMA/topology)
Build an automated harness (preferably extending krkn-chaos (https://github.com/krkn-chaos/krkn) ) to run controlled experiments with scoped blast radius, and evidence capture (logs, traces, metrics)
Integrate fault signals into pipelines (GitHub Actions or otherwise) as resilience gates alongside performance gates
Develop detection and diagnostics: dashboards and alerts for pre-fault signals (e.g., vLLM queue depth, GPU throttling, P2P downgrades, KV-cache pressure, allocator fragmentation)
Triage and root-cause resilience regressions from field/customer issues; upstream bugs and fixes to vLLM and llm-d
Explore and experiment with emerging AI technologies relevant to software development and testing, proactively identifying opportunities to incorporate new AI capabilities into existing workflows and tooling.
Publish learnings (internal/external): failure patterns, playbooks, SLO templates, experiment libraries, and reference architectures; present at internal/external forums

What you will bring:

3+ years in reliability, and/or performance engineering on large-scale distributed systems
Expertise in systems‑level software design
Expertise with Kubernetes and modern LLM inference server stack (e.g., vLLM, TensorRT-LLM, TGI)
Observability & forensics skills with experience with Prometheus/Grafana, OpenTelemetry tracing, eBPF/BPFTrace/perf, Nsight Systems, PyTorch Profiler; adept at converting raw signals into actionable narratives.
Fluency in Python (data & ML), strong Bash/Linux skills
Exceptional communication skills - able to translate raw data into customer value and executive narratives
Commitment to open‑source values and upstream collaboration

The following is considered a plus:

Master’s or PhD in Computer Science, AI, or a related field
History of upstream contributions and community leadership, public talks or blogs on resilience, or chaos engineering
Competitive benchmarking and failure characterization at scale.

The salary range for this position is $127,890.00 - $211,180.00. Actual offer will be based on your qualifications.

Pay Transparency

● Comprehensive medical, dental, and vision coverage

● Flexible Spending Account - healthcare and dependent care

● Health Savings Account - high deductible medical plan

● Retirement 401(k) with employer match

● Paid time off and holidays

● Paid parental leave plans for all new parents

● Leave benefits including disability, paid family medical leave, and paid military leave

Full job details

These jobs might be a good fit

Red hat Senior Performance Resilience Engineer - LLM Inference United States, District of Columbia, Washington

Red hat Senior Performance Resilience Engineer - LLM Inference United States, Colorado, Denver

Red hat Senior Performance Resilience Engineer - LLM Inference United States, New York, City of Albany

Red hat Senior Performance Resilience Engineer - LLM Inference United States, California, Sacramento

Professional CV Builder tool from Expoint.

Get to the top of the "yes list" with a standout CV!

CREATE CV

Find your next career move in the high tech industry with Expoint. Our platform offers a wide range of Senior Engineer job opportunities in the United States, Illinois, Springfield area, giving you access to the best companies in the field. Whether you're looking for a new challenge or a change of scenery, Expoint makes it easy to find your perfect job match. With our easy-to-use search engine, you can quickly find job opportunities in your desired location and connect with top companies. Sign up today and take the next step in your high tech career with Expoint.