

Share
What you will do:
Own the resilience testing roadmap for vLLM and llm-d: define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD
Design GPU/accelerator-aware fault experiments that target vLLM and the stack beneath it (drivers, GPU Operator/DevicePlugin, NCCL/collectives, storage/network paths, NUMA/topology)
Build an automated harness (preferably extending krkn-chaos (https://github.com/krkn-chaos/krkn) ) to run controlled experiments with scoped blast radius, and evidence capture (logs, traces, metrics)
Integrate fault signals into pipelines (GitHub Actions or otherwise) as resilience gates alongside performance gates
Develop detection and diagnostics: dashboards and alerts for pre-fault signals (e.g., vLLM queue depth, GPU throttling, P2P downgrades, KV-cache pressure, allocator fragmentation)
Triage and root-cause resilience regressions from field/customer issues; upstream bugs and fixes to vLLM and llm-d
Explore and experiment with emerging AI technologies relevant to software development and testing, proactively identifying opportunities to incorporate new AI capabilities into existing workflows and tooling.
Publish learnings (internal/external): failure patterns, playbooks, SLO templates, experiment libraries, and reference architectures; present at internal/external forums
What you will bring:
3+ years in reliability, and/or performance engineering on large-scale distributed systems
Expertise in systems‑level software design
Expertise with Kubernetes and modern LLM inference server stack (e.g., vLLM, TensorRT-LLM, TGI)
Observability & forensics skills with experience with Prometheus/Grafana, OpenTelemetry tracing, eBPF/BPFTrace/perf, Nsight Systems, PyTorch Profiler; adept at converting raw signals into actionable narratives.
Fluency in Python (data & ML), strong Bash/Linux skills
Exceptional communication skills - able to translate raw data into customer value and executive narratives
Commitment to open‑source values and upstream collaboration
The following is considered a plus:
Master’s or PhD in Computer Science, AI, or a related field
History of upstream contributions and community leadership, public talks or blogs on resilience, or chaos engineering
Competitive benchmarking and failure characterization at scale.
The salary range for this position is $127,890.00 - $211,180.00. Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit

Share
What you will bring:
Strategic Enablement Leadership: Drive the development and execution of technical enablement plans for key partners, ensuring alignment with Red Hat’s business priorities and regional growth objectives.
Solution Design & Innovation: Architect and deliver customized, scalable solutions that address complex business and technical challenges across hybrid cloud environments.
Cross-Functional Coordination: Lead collaboration across Red Hat’s technical ecosystem — including product specialists, consulting, and support — to accelerate customer adoption and ensure long-term success.
Technical Mentorship: Coach and develop junior architects and partner engineers, fostering a culture of excellence, innovation, and best practices within the team.
Hands-On Evangelism: Facilitate technical workshops, RHUGs, community events, proof-of-concepts, and joint innovation initiatives with partners and customers to showcase the value of Red Hat’s technologies.
The salary range for this position is $202,380.00 - $323,780.00 (inclusive of base pay + target incentive compensation). Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit

Share
What you will do:
Commitment to providing an exceptional customer experience by using professional communication and applying product knowledge and deep troubleshooting to perform direct actions in cluster environments to resolve various issues.
Contribute to global initiatives and projects to constantly reduce customer effort, improve tooling, and design and write automation software to improve efficiency.
Act as the direct contact and advisor for customer inquiries and issues with their Cloud Services through our Customer Portal, conference calls, and remote access.
Proactively analyze cluster status, identify single points of failure and other high-risk architecture issues; propose and implement more resilient resolutions.
Record customer interactions including investigation, troubleshooting, and resolution of issues, to document diagnostic steps and issue resolution to create reusable solutions for future incidents.
Create and maintain knowledge articles aligned with the KCS (Knowledge-Centered Service) methodology.
Partner with internal teams and external parties to deliver seamless infrastructure support for Red Hat’s Cloud Services.
Manage incident and issue workloads to ensure that all customer issues are handled and resolved in a timely manner.
Maintain a strong work ethic, able to work effectively as part of a team, and focus on customers and resolving their issues.
Be available to perform weekend shift duties on a rotational schedule.
What you will bring:
5+ years of experience in a customer-facing technical support or solutions engineering role.
Proven experience in Infrastructure Implementation, Deployment, Administration, and Production Support of container technologies and orchestration platforms (e.g., CRI-O, Kubernetes, xKS, Docker, OpenShift Container Platform).
Experience with developer workflows, Continuous Integration (e.g., Jenkins), and Continuous Deployment paradigms.
Exceptional technical, analytical, and troubleshooting skills using tools like curl, strace, oc (kubectl), and Wireshark analysis to investigate and form precise action plans for issue remediation with components such as networking, system performance issues, Kubernetes, OpenShift Container Platform, Service Mesh, and RESTful API calls.
Experience working with tools surrounding the Kubernetes ecosystem such as Prometheus, Grafana, FluentD, etc.
Experience working with configuration management tools (e.g., Ansible, Terraform) and monitoring and automation tools (e.g., Ansible, Splunk).
Proficient scripting and automation skills (e.g., Python, Bash, Go) to convert manual and maintenance functions into fully orchestrated automation is a plus.
Ability to operate in complex, highly secure, and highly available environments and interact with Site Reliability Engineering (SRE) domain experts maintaining those environments.
Familiarity with established ITIL practices such as Incident, Change, Problem, and Release Management.
Excellent English communication skills (written and verbal) and interpersonal skills, with a desire to mentor other members of the support team and share technical knowledge in a helpful and timely fashion.
Experience logging issues and working with issue tracking tools such as Jira.
Ability to work effectively as part of an agile team, actively communicate status, and complete deliverables on schedule with a strong sense of initiative and ownership.
Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
Ability to work effectively and collaborate within a geographically distributed, global team.
The salary range for this position is $84,400.00 - $134,970.00. Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit

Share
About the Job
In this role, you will work with a diverse team of highly motivated engineers on designing, implementing, and integrating AI Core Platform capabilities and contribute directly to upstream communities such as PyTorch, FSDP, vLLM and Triton, and others.
You will work closely with product engineering groups within Red Hat focused on integrating and delivering enterprise-ready software that’s hardened, tested, and securely distributed with our AI/MLOps platforms.
What you will do
Develop and maintain a high-quality, high-performing AI Core platform open source upstream stack enabling Red Hat AI/MLOps platforms offerings
Maintain CI/CD build pipelines for container images that allow faster, more secure, reliable, and frequent releases
Contribute directly to upstream runtime communities such as PyTorch, FSDP, vLLM, Triton, and others.
Consistently participate and take leadership opportunities in community meetings and foundation/project governance topics.
Share upstream contributions at events, conferences and via technical blogs and publications
Coordinate and communication with various Red Hat product and open source stakeholders
Applying a growth mindset by staying up-to-date on the latest advancements in AI frameworks, hardware accelerators, and ML advancements
What you will bring
Highly experienced with programming in Python and PyTorch
Experience with hardware accelerators (e.g., GPUs, FPGAs) for AI workloads
Experience with Python packaging such as PyPI libraries
Development experience with C++ and CUDA APIs is a big plus
Solid understanding of the fundamentals of model training and inferencing architectures
Experience with Git, shell scripting, and related technologies
Experience with the development of containerized applications in Kubernetes
Experience with Cloud Computing using at least one of the following Cloud infrastructures AWS, GCP, Azure, or IBM Cloud
Ability to work across a large distributed hybrid engineering team
Experience with open-source development is a plus
The salary range for this position is $170,600.00 - $281,370.00. Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit

Share
About the Job
In this role, you will work with a diverse team of highly motivated engineers on designing, implementing, and integrating AI Core Platform capabilities and contribute directly to upstream communities such as PyTorch, FSDP, vLLM and Triton, and others.
You will work closely with product engineering groups within Red Hat focused on integrating and delivering enterprise-ready software that’s hardened, tested, and securely distributed with our AI/MLOps platforms.
What you will do
Develop and maintain a high-quality, high-performing AI Core platform open source upstream stack enabling Red Hat AI/MLOps platforms offerings
Maintain CI/CD build pipelines for container images that allow faster, more secure, reliable, and frequent releases
Contribute directly to upstream runtime communities such as PyTorch, FSDP, vLLM, Triton, and others.
Consistently participate and take leadership opportunities in community meetings and foundation/project governance topics.
Share upstream contributions at events, conferences and via technical blogs and publications
Coordinate and communication with various Red Hat product and open source stakeholders
Applying a growth mindset by staying up-to-date on the latest advancements in AI frameworks, hardware accelerators, and ML advancements
What you will bring
Highly experienced with programming in Python and PyTorch
Experience with hardware accelerators (e.g., GPUs, FPGAs) for AI workloads
Experience with Python packaging such as PyPI libraries
Development experience with C++ and CUDA APIs is a big plus
Solid understanding of the fundamentals of model training and inferencing architectures
Experience with Git, shell scripting, and related technologies
Experience with the development of containerized applications in Kubernetes
Experience with Cloud Computing using at least one of the following Cloud infrastructures AWS, GCP, Azure, or IBM Cloud
Ability to work across a large distributed hybrid engineering team
Following is considered a plus
The salary range for this position is $170,600.00 - $281,370.00. Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit

Share
What you will do:
Own the resilience testing roadmap for vLLM and llm-d: define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD
Design GPU/accelerator-aware fault experiments that target vLLM and the stack beneath it (drivers, GPU Operator/DevicePlugin, NCCL/collectives, storage/network paths, NUMA/topology)
Build an automated harness (preferably extending krkn-chaos (https://github.com/krkn-chaos/krkn) ) to run controlled experiments with scoped blast radius, and evidence capture (logs, traces, metrics)
Integrate fault signals into pipelines (GitHub Actions or otherwise) as resilience gates alongside performance gates
Develop detection and diagnostics: dashboards and alerts for pre-fault signals (e.g., vLLM queue depth, GPU throttling, P2P downgrades, KV-cache pressure, allocator fragmentation)
Triage and root-cause resilience regressions from field/customer issues; upstream bugs and fixes to vLLM and llm-d
Explore and experiment with emerging AI technologies relevant to software development and testing, proactively identifying opportunities to incorporate new AI capabilities into existing workflows and tooling.
Publish learnings (internal/external): failure patterns, playbooks, SLO templates, experiment libraries, and reference architectures; present at internal/external forums
What you will bring:
3+ years in reliability, and/or performance engineering on large-scale distributed systems
Expertise in systems‑level software design
Expertise with Kubernetes and modern LLM inference server stack (e.g., vLLM, TensorRT-LLM, TGI)
Observability & forensics skills with experience with Prometheus/Grafana, OpenTelemetry tracing, eBPF/BPFTrace/perf, Nsight Systems, PyTorch Profiler; adept at converting raw signals into actionable narratives.
Fluency in Python (data & ML), strong Bash/Linux skills
Exceptional communication skills - able to translate raw data into customer value and executive narratives
Commitment to open‑source values and upstream collaboration
The following is considered a plus:
Master’s or PhD in Computer Science, AI, or a related field
History of upstream contributions and community leadership, public talks or blogs on resilience, or chaos engineering
Competitive benchmarking and failure characterization at scale.
The salary range for this position is $127,890.00 - $211,180.00. Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit