

Share
What you will do:
Own the resilience testing roadmap for vLLM and llm-d: define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD
Design GPU/accelerator-aware fault experiments that target vLLM and the stack beneath it (drivers, GPU Operator/DevicePlugin, NCCL/collectives, storage/network paths, NUMA/topology)
Build an automated harness (preferably extending krkn-chaos (https://github.com/krkn-chaos/krkn) ) to run controlled experiments with scoped blast radius, and evidence capture (logs, traces, metrics)
Integrate fault signals into pipelines (GitHub Actions or otherwise) as resilience gates alongside performance gates
Develop detection and diagnostics: dashboards and alerts for pre-fault signals (e.g., vLLM queue depth, GPU throttling, P2P downgrades, KV-cache pressure, allocator fragmentation)
Triage and root-cause resilience regressions from field/customer issues; upstream bugs and fixes to vLLM and llm-d
Explore and experiment with emerging AI technologies relevant to software development and testing, proactively identifying opportunities to incorporate new AI capabilities into existing workflows and tooling.
Publish learnings (internal/external): failure patterns, playbooks, SLO templates, experiment libraries, and reference architectures; present at internal/external forums
What you will bring:
3+ years in reliability, and/or performance engineering on large-scale distributed systems
Expertise in systems‑level software design
Expertise with Kubernetes and modern LLM inference server stack (e.g., vLLM, TensorRT-LLM, TGI)
Observability & forensics skills with experience with Prometheus/Grafana, OpenTelemetry tracing, eBPF/BPFTrace/perf, Nsight Systems, PyTorch Profiler; adept at converting raw signals into actionable narratives.
Fluency in Python (data & ML), strong Bash/Linux skills
Exceptional communication skills - able to translate raw data into customer value and executive narratives
Commitment to open‑source values and upstream collaboration
The following is considered a plus:
Master’s or PhD in Computer Science, AI, or a related field
History of upstream contributions and community leadership, public talks or blogs on resilience, or chaos engineering
Competitive benchmarking and failure characterization at scale.
The salary range for this position is $127,890.00 - $211,180.00. Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit

Share
What you will do:
Commitment to providing an exceptional customer experience by using professional communication and applying product knowledge and deep troubleshooting to perform direct actions in cluster environments to resolve various issues.
Contribute to global initiatives and projects to constantly reduce customer effort, improve tooling, and design and write automation software to improve efficiency.
Act as the direct contact and advisor for customer inquiries and issues with their Cloud Services through our Customer Portal, conference calls, and remote access.
Proactively analyze cluster status, identify single points of failure and other high-risk architecture issues; propose and implement more resilient resolutions.
Record customer interactions including investigation, troubleshooting, and resolution of issues, to document diagnostic steps and issue resolution to create reusable solutions for future incidents.
Create and maintain knowledge articles aligned with the KCS (Knowledge-Centered Service) methodology.
Partner with internal teams and external parties to deliver seamless infrastructure support for Red Hat’s Cloud Services.
Manage incident and issue workloads to ensure that all customer issues are handled and resolved in a timely manner.
Maintain a strong work ethic, able to work effectively as part of a team, and focus on customers and resolving their issues.
Be available to perform weekend shift duties on a rotational schedule.
What you will bring:
5+ years of experience in a customer-facing technical support or solutions engineering role.
Proven experience in Infrastructure Implementation, Deployment, Administration, and Production Support of container technologies and orchestration platforms (e.g., CRI-O, Kubernetes, xKS, Docker, OpenShift Container Platform).
Experience with developer workflows, Continuous Integration (e.g., Jenkins), and Continuous Deployment paradigms.
Exceptional technical, analytical, and troubleshooting skills using tools like curl, strace, oc (kubectl), and Wireshark analysis to investigate and form precise action plans for issue remediation with components such as networking, system performance issues, Kubernetes, OpenShift Container Platform, Service Mesh, and RESTful API calls.
Experience working with tools surrounding the Kubernetes ecosystem such as Prometheus, Grafana, FluentD, etc.
Experience working with configuration management tools (e.g., Ansible, Terraform) and monitoring and automation tools (e.g., Ansible, Splunk).
Proficient scripting and automation skills (e.g., Python, Bash, Go) to convert manual and maintenance functions into fully orchestrated automation is a plus.
Ability to operate in complex, highly secure, and highly available environments and interact with Site Reliability Engineering (SRE) domain experts maintaining those environments.
Familiarity with established ITIL practices such as Incident, Change, Problem, and Release Management.
Excellent English communication skills (written and verbal) and interpersonal skills, with a desire to mentor other members of the support team and share technical knowledge in a helpful and timely fashion.
Experience logging issues and working with issue tracking tools such as Jira.
Ability to work effectively as part of an agile team, actively communicate status, and complete deliverables on schedule with a strong sense of initiative and ownership.
Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
Ability to work effectively and collaborate within a geographically distributed, global team.
The salary range for this position is $84,400.00 - $134,970.00. Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit

Share
What you will do:
Commitment to providing an exceptional customer experience by using professional communication and applying product knowledge and deep troubleshooting to perform direct actions in cluster environments to resolve various issues.
Contribute to global initiatives and projects to constantly reduce customer effort, improve tooling, and design and write automation software to improve efficiency.
Act as the direct contact and advisor for customer inquiries and issues with their Cloud Services through our Customer Portal, conference calls, and remote access.
Proactively analyze cluster status, identify single points of failure and other high-risk architecture issues; propose and implement more resilient resolutions.
Record customer interactions including investigation, troubleshooting, and resolution of issues, to document diagnostic steps and issue resolution to create reusable solutions for future incidents.
Create and maintain knowledge articles aligned with the KCS (Knowledge-Centered Service) methodology.
Partner with internal teams and external parties to deliver seamless infrastructure support for Red Hat’s Cloud Services.
Manage incident and issue workloads to ensure that all customer issues are handled and resolved in a timely manner.
Maintain a strong work ethic, able to work effectively as part of a team, and focus on customers and resolving their issues.
Be available to perform weekend shift duties on a rotational schedule.
What you will bring:
5+ years of experience in a customer-facing technical support or solutions engineering role.
Proven experience in Infrastructure Implementation, Deployment, Administration, and Production Support of container technologies and orchestration platforms (e.g., CRI-O, Kubernetes, xKS, Docker, OpenShift Container Platform).
Experience with developer workflows, Continuous Integration (e.g., Jenkins), and Continuous Deployment paradigms.
Exceptional technical, analytical, and troubleshooting skills using tools like curl, strace, oc (kubectl), and Wireshark analysis to investigate and form precise action plans for issue remediation with components such as networking, system performance issues, Kubernetes, OpenShift Container Platform, Service Mesh, and RESTful API calls.
Experience working with tools surrounding the Kubernetes ecosystem such as Prometheus, Grafana, FluentD, etc.
Experience working with configuration management tools (e.g., Ansible, Terraform) and monitoring and automation tools (e.g., Ansible, Splunk).
Proficient scripting and automation skills (e.g., Python, Bash, Go) to convert manual and maintenance functions into fully orchestrated automation is a plus.
Ability to operate in complex, highly secure, and highly available environments and interact with Site Reliability Engineering (SRE) domain experts maintaining those environments.
Familiarity with established ITIL practices such as Incident, Change, Problem, and Release Management.
Excellent English communication skills (written and verbal) and interpersonal skills, with a desire to mentor other members of the support team and share technical knowledge in a helpful and timely fashion.
Experience logging issues and working with issue tracking tools such as Jira.
Ability to work effectively as part of an agile team, actively communicate status, and complete deliverables on schedule with a strong sense of initiative and ownership.
Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
Ability to work effectively and collaborate within a geographically distributed, global team.
The salary range for this position is $84,400.00 - $134,970.00. Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit

Share
About the job:
Red Hat is hiring a Specialist Solution Architect (OpenShift) for Financial Services North America. The Specialist will drive Sales of the OpenShift Platform. In this role, you will be the catalyst creating opportunities, solving problems, and establishing working relationships with customers in our key FSI enterprise accounts and service provider partners. You’ll need to possess excellent communication and people skills balanced with technical expertise, passion for open source, and a thorough understanding of business and IT challenges encountered in financial services. Through a series of structured in-person interactions the Specialist Solution Architect will win the trust and confidence of customer engineering, development and operations teams by aligning their requirements and use cases with the functional capabilities of the OpenShift Platform.
What you will do:
Act as a technical advisor, guiding customers from presales to post-sales implementation, ensuring successful deployments.
Lead technical validation through demos, workshops, and pilot projects to align customer needs with Ansible capabilities.
Develop reusable solution frameworks and content to empower sales teams and standardize customer outcomes.
Collaborate with product teams to enhance customer experience and advocate for customer needs internally.
Assist your team with responding to RFPs for customer success
What you will bring:
Technical Skills:
Expertise in the OpenShift Platform (certifications preferred) or Kubernetes. Strong hands on skill.
5+ years in OpenShift or Kubernbets; 5-10 years in architecture/development/consulting roles.
Proficiency in Linux and DevOps methodologies.
Proficiency in configuring accelerators such as NVIDIA, AMD or Intel GPUs.
Business Skills:
Ability to engage engineers and architects, address enterprise IT challenges, and propose cross-platform solutions.
Experience building relationships across large IT organizations and managing end-to-end proof-of-concept processes.
Preferred Qualifications:
Red Hat certifications (Red Hat Certified OpenShift Architect, RHCE) and a degree in Computer Science/Engineering.
Thought leadership through industry contributions (whitepapers, conferences, etc.) and staying updated on Kubernetes and AI trends.
The salary range for this position is $177,540.00 - $283,950.00 (inclusive of base pay + target incentive compensation). Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit

Share
What you'll do:
Discover and analyze customers' business and technical problems, and design the appropriate Red Hat solution that aligns with their needs.
Provide technical leadership to Red Hat account teams and customers through sales presentations, product demonstrations, workshops, and proofs of concept.
Utilize your technical expertise to assist sales teams in answering customer questions.
Form strategic business and technical relationships within each customer organization to identify new opportunities.
Own and attain a bookings quota by utilizing your expertise to contribute to the success of our Account Teams in closing sales opportunities.
Share your expertise by contributing to documentation (e.g., wikis, quick-start guides, blog posts, and white papers), developing reusable assets (e.g., demonstrations, workshops), and presenting at conferences and industry events.
Collaborate broadly with Red Hat Engineering and Business Units, Professional Services, Sales teams, and partners to ensure excellent customer experience with our offerings and solutions.
What you'll bring:
Experience in sales engineering, consulting, IT architecture, or equivalent in supporting large organizations.
Excellent communication, presentation, documentation, and problem-solving skills.
Positive team player attitude and a strong desire to make our customers successful.
Ability to understand customers' challenges, requirements, and technical issues and collaborate to address those with the appropriate technologies and solutions.
Knowledge of Linux or UNIX operating systems, preferably Red Hat Enterprise Linux (RHEL), or a RHEL derivative, as well as related technologies.
Knowledge of virtualization technologies for application deployment and operations like VMware vSphere, KVM, etc.
Experience with Kubernetes deployments in large organizations.
Knowledge of cloud Platform-as-a-Service (PaaS) technologies like Red Hat OpenShift or related products and solutions.
Knowledge of DevOps from the technology point of view.
Familiarity with the design and operation of medium to large-scale enterprise and cloud infrastructures, databases, and application systems with high availability requirements. Extensive experience with the application life cycle in traditional environments
Willingness to travel within the region.
One or more of the following is a plus
Experience with automation frameworks and tools like Puppet, Ansible, or Terraform.
Experience with management, logging, and observability products and solutions.
Experience or certification with any public cloud provider.
Any Red Hat certification.
The salary range for this position is $202,380.00 - $323,780.00 (inclusive of base pay + target incentive compensation). Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit

Share
1. Build the Red Hat Value Practice
Collaboratively define the vision of customer value across the entire Red Hat customer lifecycle and the value community
Build and execute strategies to establish a comprehensive value community partnering with internal teams (BD, professional services etc.)
Establish and evolve a Red Hat value framework and deliverables to help sell the value across the portfolio and scale across customer segments
Create a knowledge base including a library of reusable value assets, including value maps, benefit models, benchmarking, templates, tools etc.
2. CXO Customer Advisory on ROI and Business Value Achievement
Directly engage with customer executive teams, helping articulate the strategic and financial impact of Red Hat’s digital transformation message to accelerate Sales cycles
Develop C-level account strategies, ROI investment justifications, deal structures, commercial proposals, and value realization analysis
Serve as a key source of market insights into how our customers view the economic benefit of using Red Hat relative to on-prem or competition
Identify, lead and contribute to the creation of thought leadership (best practices, white papers, workshops, etc.) for pipeline building
3. Sales and Customer Success Counsel
Act as a trusted advisor to regional Sales management by providing guidance on account strategies and helping prioritize Sales pursuits
Manage relationships with regional Sales and Customer Success leadership to build effective enterprise Sales best practices
Enable Sales teams to elevate their executive discussions by facilitating value-focused discovery and delivering business value impact analysis
What You'll Bring:
8+ years of detailed business case development, ROI / TCO financial modelling and executive storytelling experience
Strong experience in Management Consulting or Investment Banking
Strong skills in developing executive narrative slideware with inputs from multiple internal stakeholders at times
Experience engaging with C-level executives and complex internal stakeholder groups
Strong program management and communication skills
Presentation skills including public speaking, meeting facilitation, and whiteboarding
Ability to accomodate limited business travel
Would be a Plus:
Familiarity with selling complex solutions or transformation programs in the cloud infrastructure or platform space
Prior experience with people leadership
Bachelor’s degree; Master’s degree in business administration
The salary range for this position is $260,330.00 - $429,590.00 (inclusive of base pay + target incentive compensation). Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit

Share
Primary Job Responsibilities
Serve as a subject matter expert to the SA/SSA community in respective geography, ensuring that architectural patterns are reused in order to create Sales, Support, and Engineering efficiencies.
Understand both how to define an architecture but also how to code as a reusable GitOps Pattern, utilizing tools like Helm or Ansible.
Be able to debug code written by other engineers and, when needed, propose improvements (in the form of pull requests).
Serve as a key facilitator uniting sellers with Open Source community practices, enabling the sharing of that knowledge inside and outside of our company.
Participate in Executive Briefing Centre (EBC) sessions to capture future level requirements and report them in a consistent manner through the Field Feedback Forum.
Review of any technical architecture proposals in the region before they are sent by the Solution Architect or SSA to their customers.
Understand and articulate the business requirements of customers and partners and what are their drivers for working with Red Hat.
Required Skills
Experience with cloud-native architecture, microservices, container orchestration, and hybrid cloud platforms
Must have deep experience with application networking and security, and a good understanding of concepts and tools such as service mesh, mTLS, container networking interface, SPIFFE/SPIRE
Ability to speak, write, and share insights on behalf of Red Hat in a variety of scenarios, including both internal and external audiences (i.e., should be able to serve as a media spokesperson for their geography).
Strong ability to define and describe complex domain architecture from a logical, functional, and technical perspective.
Implement the deployment of those architecture following a DevOps and GitOps pattern
Contribute code/configurations in at least two languages such as Python, Go, Helm, Ansible, Rust, or Bash.
Experience Required
Strong knowledge of Red Hat’s product portfolio from their domain or specialty perspective.
Demonstrated success with previous customer implementations; ability to engage and expand product cross-sell with customers over time.
Demonstrated capability in reuse of existing patterns (avoiding bespoke solutions for each customer) and a healthy modularization practice which enables reusability.
The salary range for this position is $189,600.00 - $312,730.00. Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit

Share
What you will do:
Own the resilience testing roadmap for vLLM and llm-d: define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD
Design GPU/accelerator-aware fault experiments that target vLLM and the stack beneath it (drivers, GPU Operator/DevicePlugin, NCCL/collectives, storage/network paths, NUMA/topology)
Build an automated harness (preferably extending krkn-chaos (https://github.com/krkn-chaos/krkn) ) to run controlled experiments with scoped blast radius, and evidence capture (logs, traces, metrics)
Integrate fault signals into pipelines (GitHub Actions or otherwise) as resilience gates alongside performance gates
Develop detection and diagnostics: dashboards and alerts for pre-fault signals (e.g., vLLM queue depth, GPU throttling, P2P downgrades, KV-cache pressure, allocator fragmentation)
Triage and root-cause resilience regressions from field/customer issues; upstream bugs and fixes to vLLM and llm-d
Explore and experiment with emerging AI technologies relevant to software development and testing, proactively identifying opportunities to incorporate new AI capabilities into existing workflows and tooling.
Publish learnings (internal/external): failure patterns, playbooks, SLO templates, experiment libraries, and reference architectures; present at internal/external forums
What you will bring:
3+ years in reliability, and/or performance engineering on large-scale distributed systems
Expertise in systems‑level software design
Expertise with Kubernetes and modern LLM inference server stack (e.g., vLLM, TensorRT-LLM, TGI)
Observability & forensics skills with experience with Prometheus/Grafana, OpenTelemetry tracing, eBPF/BPFTrace/perf, Nsight Systems, PyTorch Profiler; adept at converting raw signals into actionable narratives.
Fluency in Python (data & ML), strong Bash/Linux skills
Exceptional communication skills - able to translate raw data into customer value and executive narratives
Commitment to open‑source values and upstream collaboration
The following is considered a plus:
Master’s or PhD in Computer Science, AI, or a related field
History of upstream contributions and community leadership, public talks or blogs on resilience, or chaos engineering
Competitive benchmarking and failure characterization at scale.
The salary range for this position is $127,890.00 - $211,180.00. Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit