Expoint – all jobs in one place
Finding the best job has never been easier

Technical Support Engineer – Identity Management jobs at Red Hat in United States, Salt Lake City

Discover your perfect match with Expoint. Search for job opportunities as a Technical Support Engineer – Identity Management in United States, Salt Lake City and join the network of leading companies in the high tech industry, like Red Hat. Sign up now and find your dream job with Expoint
Company (1)
Job type
Job categories
Job title (1)
United States
State
Salt Lake City
11 jobs found
06.09.2025
R

Red hat Senior Performance Resilience Engineer - LLM Inference United States, New York, City of Albany

Limitless High-tech career opportunities - Expoint
Description:

What you will do:

  • Own the resilience testing roadmap for vLLM and llm-d: define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD

  • Design GPU/accelerator-aware fault experiments that target vLLM and the stack beneath it (drivers, GPU Operator/DevicePlugin, NCCL/collectives, storage/network paths, NUMA/topology)

  • Build an automated harness (preferably extending krkn-chaos (https://github.com/krkn-chaos/krkn) ) to run controlled experiments with scoped blast radius, and evidence capture (logs, traces, metrics)

  • Integrate fault signals into pipelines (GitHub Actions or otherwise) as resilience gates alongside performance gates

  • Develop detection and diagnostics: dashboards and alerts for pre-fault signals (e.g., vLLM queue depth, GPU throttling, P2P downgrades, KV-cache pressure, allocator fragmentation)

  • Triage and root-cause resilience regressions from field/customer issues; upstream bugs and fixes to vLLM and llm-d

  • Explore and experiment with emerging AI technologies relevant to software development and testing, proactively identifying opportunities to incorporate new AI capabilities into existing workflows and tooling.

  • Publish learnings (internal/external): failure patterns, playbooks, SLO templates, experiment libraries, and reference architectures; present at internal/external forums

What you will bring:

  • 3+ years in reliability, and/or performance engineering on large-scale distributed systems

  • Expertise in systems‑level software design

  • Expertise with Kubernetes and modern LLM inference server stack (e.g., vLLM, TensorRT-LLM, TGI)

  • Observability & forensics skills with experience with Prometheus/Grafana, OpenTelemetry tracing, eBPF/BPFTrace/perf, Nsight Systems, PyTorch Profiler; adept at converting raw signals into actionable narratives.

  • Fluency in Python (data & ML), strong Bash/Linux skills

  • Exceptional communication skills - able to translate raw data into customer value and executive narratives

  • Commitment to open‑source values and upstream collaboration

The following is considered a plus:

  • Master’s or PhD in Computer Science, AI, or a related field

  • History of upstream contributions and community leadership, public talks or blogs on resilience, or chaos engineering

  • Competitive benchmarking and failure characterization at scale.

The salary range for this position is $127,890.00 - $211,180.00. Actual offer will be based on your qualifications.

Pay Transparency

● Comprehensive medical, dental, and vision coverage

● Flexible Spending Account - healthcare and dependent care

● Health Savings Account - high deductible medical plan

● Retirement 401(k) with employer match

● Paid time off and holidays

● Paid parental leave plans for all new parents

● Leave benefits including disability, paid family medical leave, and paid military leave

Expand
04.07.2025
R

Red hat Partner Technical Account Manager - OpenShift Container Plat... United States, Missouri, Jefferson City

Limitless High-tech career opportunities - Expoint
Description:

About The Job

You will forge relationships with your partners, develop a deep technical understanding of their Red Hat implementation, share technical best practices, and act as point of contact for any major incidents, managing the partner’s expectations and communications through resolution of such incidents. You will tailor support for each partner, work closely with the extended virtual account team and advocate on their behalf. At the same time, you'll work closely with our engineering, R&D, product management, and technical support teams to debug, test, and resolve issues. As a PTAM, you will be supported in your career with continuous learning offerings, certification support, and challenging growth opportunities.

What You Will Do

• Develop relationships with key business and IT stakeholders and become an expert on a partner’s solutions by understanding their top business goals and priorities

• Perform technical reviews and share knowledge to proactively identify and prevent issues

• Support technology partners implementing automated and containerized cloud application platform solutions

• Learn new technologies quickly, including topics like container orchestration, container registries, container build strategies, and microservices on container platforms

• Establish and maintain parity with Red Hat cloud technologies strategy

• Engage product engineering teams to help develop solution patterns, based on partner engagements, as well as personal experience, that drive platform adoption

• Communicate how specific Red Hat cloud solutions and our cloud roadmap align to partner use cases

• Forewarn partners of technology changes or potential disruptions to their service and advise on mitigation strategies

• Provide advice and guidance to partners about current and future Red Hat products

• Identify training opportunities and work with our learning and enablement teams to provide targeted training to partner support personnel

• Troubleshoot technical issues and drive issue escalation with Red Hat, partner and customer teams

• Complete analysis and present periodic reviews of operational performance to leadership

• Manage partner support cases and maintain clear and concise case documentation

• Create partner engagement plans and keep documentation relevant to a partner's solution updated

• Manage and grow partner relationships by delivering attentive, relationship-based support

• Build a sense of trust with partners and serve as their advocate within Red Hat

• Contribute internally to the Red Hat team, share knowledge and best practices with team members, contribute to internal projects and initiatives, and serve as a Subject Matter Expert (SME) and mentor for specific technical or process areas

• Travel, as necessary, to visit partners and attend events

What You Will Bring

• Hands-on experience with operating Kubernetes or Kubernetes-based platforms like Red Hat OpenShift Container Platform.

• Expertise with containers and container management

• 3+ years of Linux or UNIX system administration experience

• Experience with cloud or server virtualization

• Experience with Linux, preferably Red Hat Enterprise Linux (RHEL) or a derivative

• Ability to manage and grow existing enterprise partner relationships by delivering proactive, relationship-based support

• Outstanding verbal and written communication skills; ability to convey complex information to partners clearly and concisely

• Competent comprehension of enterprise architecture and strategic business drivers

• Ability to manage multiple issues and projects with an eye for detail

• Direct experience with a variety of technology partners

• Experience with training and presentation delivery

The following are considered a plus

• Experience in a support, operations, development, engineering, or quality assurance organization

• Red Hat Certified Engineer (RHCE) or equivalent experience

• Red Hat Certified Specialist in OpenShift Administration or equivalent experience

• Experience with Amazon Web Services, Azure, Google Cloud is a plus

• Bachelor's degree in a technology-related discipline, preferably computer science or engineering

• Experience working in DevOps environments

• Prior experience in a technical leadership or mentorship role

• Technical knowledge of the Linux kernel and Linux file system

• Expertise with enterprise cloud solutions such as Platform-as-a-Service (PaaS), Infrastructure-as-a-Service (IaaS), and Software-as-a-Service (SaaS)

• Expertise with cloud management (Red Hat CloudForms, Cloud Formation, Terraform, etc) and IT Automation (Red Hat Ansible)

• Software engineering background; experience with RPM-based Linux and Java technologies

• Experience containerizing applications for deployment in cloud environments

• Good comprehension of continuous integration (CI) and continuous delivery (CD) concepts

• Familiarity with source code management tools like Git or SVN

• Knowledge of OpenShift Container Storage and OpenShift Data Foundation

• Experience in software-defined storage technologies like, Ceph, Gluster, or other enterprise storage platforms

• Experience in storage configuration, deployment, administration

The salary range for this position is $94,550.00 - $151,170.00. Actual offer will be based on your qualifications.

Pay Transparency

● Comprehensive medical, dental, and vision coverage

● Flexible Spending Account - healthcare and dependent care

● Health Savings Account - high deductible medical plan

● Retirement 401(k) with employer match

● Paid time off and holidays

● Paid parental leave plans for all new parents

● Leave benefits including disability, paid family medical leave, and paid military leave

Expand
28.06.2025
R

Red hat Senior Technical Support Engineer United States, Utah, Salt Lake City

Limitless High-tech career opportunities - Expoint
Description:

What you will do:

  • Commitment to providing an exceptional customer experience by using professional communication and applying product knowledge and deep troubleshooting to perform direct actions in cluster environments to resolve various issues.

  • Contribute to global initiatives and projects to constantly reduce customer effort, improve tooling, and design and write automation software to improve efficiency.

  • Act as the direct contact and advisor for customer inquiries and issues with their Cloud Services through our Customer Portal, conference calls, and remote access.

  • Proactively analyze cluster status, identify single points of failure and other high-risk architecture issues; propose and implement more resilient resolutions.

  • Record customer interactions including investigation, troubleshooting, and resolution of issues, to document diagnostic steps and issue resolution to create reusable solutions for future incidents.

  • Create and maintain knowledge articles aligned with the KCS (Knowledge-Centered Service) methodology.

  • Partner with internal teams and external parties to deliver seamless infrastructure support for Red Hat’s Cloud Services.

  • Manage incident and issue workloads to ensure that all customer issues are handled and resolved in a timely manner.

  • Maintain a strong work ethic, able to work effectively as part of a team, and focus on customers and resolving their issues.

  • Be available to perform weekend shift duties on a rotational schedule.

What you will bring:

  • 5+ years of experience in a customer-facing technical support or solutions engineering role.

  • Proven experience in Infrastructure Implementation, Deployment, Administration, and Production Support of container technologies and orchestration platforms (e.g., CRI-O, Kubernetes, xKS, Docker, OpenShift Container Platform).

  • Experience with developer workflows, Continuous Integration (e.g., Jenkins), and Continuous Deployment paradigms.

  • Exceptional technical, analytical, and troubleshooting skills using tools like curl, strace, oc (kubectl), and Wireshark analysis to investigate and form precise action plans for issue remediation with components such as networking, system performance issues, Kubernetes, OpenShift Container Platform, Service Mesh, and RESTful API calls.

  • Experience working with tools surrounding the Kubernetes ecosystem such as Prometheus, Grafana, FluentD, etc.

  • Experience working with configuration management tools (e.g., Ansible, Terraform) and monitoring and automation tools (e.g., Ansible, Splunk).

  • Proficient scripting and automation skills (e.g., Python, Bash, Go) to convert manual and maintenance functions into fully orchestrated automation is a plus.

  • Ability to operate in complex, highly secure, and highly available environments and interact with Site Reliability Engineering (SRE) domain experts maintaining those environments.

  • Familiarity with established ITIL practices such as Incident, Change, Problem, and Release Management.

  • Excellent English communication skills (written and verbal) and interpersonal skills, with a desire to mentor other members of the support team and share technical knowledge in a helpful and timely fashion.

  • Experience logging issues and working with issue tracking tools such as Jira.

  • Ability to work effectively as part of an agile team, actively communicate status, and complete deliverables on schedule with a strong sense of initiative and ownership.

  • Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.

  • Ability to work effectively and collaborate within a geographically distributed, global team.

The salary range for this position is $84,400.00 - $134,970.00. Actual offer will be based on your qualifications.

Pay Transparency

● Comprehensive medical, dental, and vision coverage

● Flexible Spending Account - healthcare and dependent care

● Health Savings Account - high deductible medical plan

● Retirement 401(k) with employer match

● Paid time off and holidays

● Paid parental leave plans for all new parents

● Leave benefits including disability, paid family medical leave, and paid military leave

Expand
28.06.2025
R

Red hat Technical Account Manager - Ansible Automation Platform United States, Utah, Salt Lake City

Limitless High-tech career opportunities - Expoint
Description:

What You Will Do

• Support enterprise customers implementing Red Hat Ansible Automated Platform solutions

• Serve as the primary customer advocate within Red Hat, facilitating communication and collaboration across teams

• Deliver Red Hat portfolio roadmap updates and assist customers with product upgrades and implementation

• Rapidly learn and stay current with new technologies, including container orchestration, registries, build strategies, microservices, and automation environments

• Specialize in Ansible Automation Platform, providing expertise on its implementation and use

• Perform technical reviews to proactively identify and prevent issues, sharing knowledge across teams

• Gain a comprehensive understanding of the customer's technical infrastructures, environments, hardware, and product usage

• Investigate and respond to support requests via various channels, including online, phone, video call, chat, etc.

• Provide strategic advice and guidance on current and future Red Hat products and solutions

• Manage customer cases, maintaining clear and concise documentation

• Collaborate with engineering, R&D, product management, and technical support teams

• Create and maintain technical documentation for issue resolution and knowledge sharing

• Manage and grow customer relationships through attentive, relationship-based support

• Visit customer sites as needed and ensure exceptional service experience

What You Will Bring

• Experience in a technical support, software development or engineering, or quality assurance organization

• Extensive technical knowledge of Red Hat Ansible Automated Platform and similar automation technologies, including Chef, Puppet, SaltStack, etc; broad knowledge of automation practices and principles

• Experience with configuration management, application deployment, and infrastructure orchestration technologies

• Ability to manage and grow existing customer relationships by delivering proactive, relationship-based support

• Outstanding verbal and written communication skills

• Ability to convey complex information to customers clearly and concisely

• Ability to manage multiple issues and projects

• Bachelor's degree in a technology-related discipline is preferred

• Residence within the U.S. Central or Eastern Time Zone

• Software engineering background; experience with RPM-based Linux technologies

• Experience with Linux system administration, preferably Red Hat Enterprise Linux (RHEL) or a derivative is preferred

• Experience working in DevOps environments preferred

• Experience with container technologies such as Docker, Podman, and Kubernetes preferred

• Experience deploying applications in cloud environments and developing containerized applications a plus

• Good comprehension of continuous integration (CI) and continuous delivery (CD) concepts preferred

• Familiarity with source code management tools like Git or Apache Subversion (SVN) a plus

The salary range for this position is $94,550.00 - $151,170.00. Actual offer will be based on your qualifications.

Pay Transparency

● Comprehensive medical, dental, and vision coverage

● Flexible Spending Account - healthcare and dependent care

● Health Savings Account - high deductible medical plan

● Retirement 401(k) with employer match

● Paid time off and holidays

● Paid parental leave plans for all new parents

● Leave benefits including disability, paid family medical leave, and paid military leave

Expand
28.06.2025
R

Red hat Senior Technical Support Engineer United States, Nevada, Carson City

Limitless High-tech career opportunities - Expoint
Description:

What you will do:

  • Commitment to providing an exceptional customer experience by using professional communication and applying product knowledge and deep troubleshooting to perform direct actions in cluster environments to resolve various issues.

  • Contribute to global initiatives and projects to constantly reduce customer effort, improve tooling, and design and write automation software to improve efficiency.

  • Act as the direct contact and advisor for customer inquiries and issues with their Cloud Services through our Customer Portal, conference calls, and remote access.

  • Proactively analyze cluster status, identify single points of failure and other high-risk architecture issues; propose and implement more resilient resolutions.

  • Record customer interactions including investigation, troubleshooting, and resolution of issues, to document diagnostic steps and issue resolution to create reusable solutions for future incidents.

  • Create and maintain knowledge articles aligned with the KCS (Knowledge-Centered Service) methodology.

  • Partner with internal teams and external parties to deliver seamless infrastructure support for Red Hat’s Cloud Services.

  • Manage incident and issue workloads to ensure that all customer issues are handled and resolved in a timely manner.

  • Maintain a strong work ethic, able to work effectively as part of a team, and focus on customers and resolving their issues.

  • Be available to perform weekend shift duties on a rotational schedule.

What you will bring:

  • 5+ years of experience in a customer-facing technical support or solutions engineering role.

  • Proven experience in Infrastructure Implementation, Deployment, Administration, and Production Support of container technologies and orchestration platforms (e.g., CRI-O, Kubernetes, xKS, Docker, OpenShift Container Platform).

  • Experience with developer workflows, Continuous Integration (e.g., Jenkins), and Continuous Deployment paradigms.

  • Exceptional technical, analytical, and troubleshooting skills using tools like curl, strace, oc (kubectl), and Wireshark analysis to investigate and form precise action plans for issue remediation with components such as networking, system performance issues, Kubernetes, OpenShift Container Platform, Service Mesh, and RESTful API calls.

  • Experience working with tools surrounding the Kubernetes ecosystem such as Prometheus, Grafana, FluentD, etc.

  • Experience working with configuration management tools (e.g., Ansible, Terraform) and monitoring and automation tools (e.g., Ansible, Splunk).

  • Proficient scripting and automation skills (e.g., Python, Bash, Go) to convert manual and maintenance functions into fully orchestrated automation is a plus.

  • Ability to operate in complex, highly secure, and highly available environments and interact with Site Reliability Engineering (SRE) domain experts maintaining those environments.

  • Familiarity with established ITIL practices such as Incident, Change, Problem, and Release Management.

  • Excellent English communication skills (written and verbal) and interpersonal skills, with a desire to mentor other members of the support team and share technical knowledge in a helpful and timely fashion.

  • Experience logging issues and working with issue tracking tools such as Jira.

  • Ability to work effectively as part of an agile team, actively communicate status, and complete deliverables on schedule with a strong sense of initiative and ownership.

  • Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.

  • Ability to work effectively and collaborate within a geographically distributed, global team.

The salary range for this position is $84,400.00 - $134,970.00. Actual offer will be based on your qualifications.

Pay Transparency

● Comprehensive medical, dental, and vision coverage

● Flexible Spending Account - healthcare and dependent care

● Health Savings Account - high deductible medical plan

● Retirement 401(k) with employer match

● Paid time off and holidays

● Paid parental leave plans for all new parents

● Leave benefits including disability, paid family medical leave, and paid military leave

Expand
28.06.2025
R

Red hat Partner Technical Account Manager - OpenShift Container Plat... United States, Nevada, Carson City

Limitless High-tech career opportunities - Expoint
Description:

About The Job

You will forge relationships with your partners, develop a deep technical understanding of their Red Hat implementation, share technical best practices, and act as point of contact for any major incidents, managing the partner’s expectations and communications through resolution of such incidents. You will tailor support for each partner, work closely with the extended virtual account team and advocate on their behalf. At the same time, you'll work closely with our engineering, R&D, product management, and technical support teams to debug, test, and resolve issues. As a PTAM, you will be supported in your career with continuous learning offerings, certification support, and challenging growth opportunities.

What You Will Do

• Develop relationships with key business and IT stakeholders and become an expert on a partner’s solutions by understanding their top business goals and priorities

• Perform technical reviews and share knowledge to proactively identify and prevent issues

• Support technology partners implementing automated and containerized cloud application platform solutions

• Learn new technologies quickly, including topics like container orchestration, container registries, container build strategies, and microservices on container platforms

• Establish and maintain parity with Red Hat cloud technologies strategy

• Engage product engineering teams to help develop solution patterns, based on partner engagements, as well as personal experience, that drive platform adoption

• Communicate how specific Red Hat cloud solutions and our cloud roadmap align to partner use cases

• Forewarn partners of technology changes or potential disruptions to their service and advise on mitigation strategies

• Provide advice and guidance to partners about current and future Red Hat products

• Identify training opportunities and work with our learning and enablement teams to provide targeted training to partner support personnel

• Troubleshoot technical issues and drive issue escalation with Red Hat, partner and customer teams

• Complete analysis and present periodic reviews of operational performance to leadership

• Manage partner support cases and maintain clear and concise case documentation

• Create partner engagement plans and keep documentation relevant to a partner's solution updated

• Manage and grow partner relationships by delivering attentive, relationship-based support

• Build a sense of trust with partners and serve as their advocate within Red Hat

• Contribute internally to the Red Hat team, share knowledge and best practices with team members, contribute to internal projects and initiatives, and serve as a Subject Matter Expert (SME) and mentor for specific technical or process areas

• Travel, as necessary, to visit partners and attend events

What You Will Bring

• Hands-on experience with operating Kubernetes or Kubernetes-based platforms like Red Hat OpenShift Container Platform.

• Expertise with containers and container management

• 3+ years of Linux or UNIX system administration experience

• Experience with cloud or server virtualization

• Experience with Linux, preferably Red Hat Enterprise Linux (RHEL) or a derivative

• Ability to manage and grow existing enterprise partner relationships by delivering proactive, relationship-based support

• Outstanding verbal and written communication skills; ability to convey complex information to partners clearly and concisely

• Competent comprehension of enterprise architecture and strategic business drivers

• Ability to manage multiple issues and projects with an eye for detail

• Direct experience with a variety of technology partners

• Experience with training and presentation delivery

The following are considered a plus

• Experience in a support, operations, development, engineering, or quality assurance organization

• Red Hat Certified Engineer (RHCE) or equivalent experience

• Red Hat Certified Specialist in OpenShift Administration or equivalent experience

• Experience with Amazon Web Services, Azure, Google Cloud is a plus

• Bachelor's degree in a technology-related discipline, preferably computer science or engineering

• Experience working in DevOps environments

• Prior experience in a technical leadership or mentorship role

• Technical knowledge of the Linux kernel and Linux file system

• Expertise with enterprise cloud solutions such as Platform-as-a-Service (PaaS), Infrastructure-as-a-Service (IaaS), and Software-as-a-Service (SaaS)

• Expertise with cloud management (Red Hat CloudForms, Cloud Formation, Terraform, etc) and IT Automation (Red Hat Ansible)

• Software engineering background; experience with RPM-based Linux and Java technologies

• Experience containerizing applications for deployment in cloud environments

• Good comprehension of continuous integration (CI) and continuous delivery (CD) concepts

• Familiarity with source code management tools like Git or SVN

• Knowledge of OpenShift Container Storage and OpenShift Data Foundation

• Experience in software-defined storage technologies like, Ceph, Gluster, or other enterprise storage platforms

• Experience in storage configuration, deployment, administration

The salary range for this position is $94,550.00 - $151,170.00. Actual offer will be based on your qualifications.

Pay Transparency

● Comprehensive medical, dental, and vision coverage

● Flexible Spending Account - healthcare and dependent care

● Health Savings Account - high deductible medical plan

● Retirement 401(k) with employer match

● Paid time off and holidays

● Paid parental leave plans for all new parents

● Leave benefits including disability, paid family medical leave, and paid military leave

Expand
28.06.2025
R

Red hat Technical Account Manager - Ansible Automation Platform United States, Nevada, Carson City

Limitless High-tech career opportunities - Expoint
Description:

What You Will Do

• Support enterprise customers implementing Red Hat Ansible Automated Platform solutions

• Serve as the primary customer advocate within Red Hat, facilitating communication and collaboration across teams

• Deliver Red Hat portfolio roadmap updates and assist customers with product upgrades and implementation

• Rapidly learn and stay current with new technologies, including container orchestration, registries, build strategies, microservices, and automation environments

• Specialize in Ansible Automation Platform, providing expertise on its implementation and use

• Perform technical reviews to proactively identify and prevent issues, sharing knowledge across teams

• Gain a comprehensive understanding of the customer's technical infrastructures, environments, hardware, and product usage

• Investigate and respond to support requests via various channels, including online, phone, video call, chat, etc.

• Provide strategic advice and guidance on current and future Red Hat products and solutions

• Manage customer cases, maintaining clear and concise documentation

• Collaborate with engineering, R&D, product management, and technical support teams

• Create and maintain technical documentation for issue resolution and knowledge sharing

• Manage and grow customer relationships through attentive, relationship-based support

• Visit customer sites as needed and ensure exceptional service experience

What You Will Bring

• Experience in a technical support, software development or engineering, or quality assurance organization

• Extensive technical knowledge of Red Hat Ansible Automated Platform and similar automation technologies, including Chef, Puppet, SaltStack, etc; broad knowledge of automation practices and principles

• Experience with configuration management, application deployment, and infrastructure orchestration technologies

• Ability to manage and grow existing customer relationships by delivering proactive, relationship-based support

• Outstanding verbal and written communication skills

• Ability to convey complex information to customers clearly and concisely

• Ability to manage multiple issues and projects

• Bachelor's degree in a technology-related discipline is preferred

• Residence within the U.S. Central or Eastern Time Zone

• Software engineering background; experience with RPM-based Linux technologies

• Experience with Linux system administration, preferably Red Hat Enterprise Linux (RHEL) or a derivative is preferred

• Experience working in DevOps environments preferred

• Experience with container technologies such as Docker, Podman, and Kubernetes preferred

• Experience deploying applications in cloud environments and developing containerized applications a plus

• Good comprehension of continuous integration (CI) and continuous delivery (CD) concepts preferred

• Familiarity with source code management tools like Git or Apache Subversion (SVN) a plus

The salary range for this position is $94,550.00 - $151,170.00. Actual offer will be based on your qualifications.

Pay Transparency

● Comprehensive medical, dental, and vision coverage

● Flexible Spending Account - healthcare and dependent care

● Health Savings Account - high deductible medical plan

● Retirement 401(k) with employer match

● Paid time off and holidays

● Paid parental leave plans for all new parents

● Leave benefits including disability, paid family medical leave, and paid military leave

Expand
Limitless High-tech career opportunities - Expoint
Description:

What you will do:

  • Own the resilience testing roadmap for vLLM and llm-d: define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD

  • Design GPU/accelerator-aware fault experiments that target vLLM and the stack beneath it (drivers, GPU Operator/DevicePlugin, NCCL/collectives, storage/network paths, NUMA/topology)

  • Build an automated harness (preferably extending krkn-chaos (https://github.com/krkn-chaos/krkn) ) to run controlled experiments with scoped blast radius, and evidence capture (logs, traces, metrics)

  • Integrate fault signals into pipelines (GitHub Actions or otherwise) as resilience gates alongside performance gates

  • Develop detection and diagnostics: dashboards and alerts for pre-fault signals (e.g., vLLM queue depth, GPU throttling, P2P downgrades, KV-cache pressure, allocator fragmentation)

  • Triage and root-cause resilience regressions from field/customer issues; upstream bugs and fixes to vLLM and llm-d

  • Explore and experiment with emerging AI technologies relevant to software development and testing, proactively identifying opportunities to incorporate new AI capabilities into existing workflows and tooling.

  • Publish learnings (internal/external): failure patterns, playbooks, SLO templates, experiment libraries, and reference architectures; present at internal/external forums

What you will bring:

  • 3+ years in reliability, and/or performance engineering on large-scale distributed systems

  • Expertise in systems‑level software design

  • Expertise with Kubernetes and modern LLM inference server stack (e.g., vLLM, TensorRT-LLM, TGI)

  • Observability & forensics skills with experience with Prometheus/Grafana, OpenTelemetry tracing, eBPF/BPFTrace/perf, Nsight Systems, PyTorch Profiler; adept at converting raw signals into actionable narratives.

  • Fluency in Python (data & ML), strong Bash/Linux skills

  • Exceptional communication skills - able to translate raw data into customer value and executive narratives

  • Commitment to open‑source values and upstream collaboration

The following is considered a plus:

  • Master’s or PhD in Computer Science, AI, or a related field

  • History of upstream contributions and community leadership, public talks or blogs on resilience, or chaos engineering

  • Competitive benchmarking and failure characterization at scale.

The salary range for this position is $127,890.00 - $211,180.00. Actual offer will be based on your qualifications.

Pay Transparency

● Comprehensive medical, dental, and vision coverage

● Flexible Spending Account - healthcare and dependent care

● Health Savings Account - high deductible medical plan

● Retirement 401(k) with employer match

● Paid time off and holidays

● Paid parental leave plans for all new parents

● Leave benefits including disability, paid family medical leave, and paid military leave

Expand
Find your dream job in the high tech industry with Expoint. With our platform you can easily search for Technical Support Engineer – Identity Management opportunities at Red Hat in United States, Salt Lake City. Whether you're seeking a new challenge or looking to work with a specific organization in a specific role, Expoint makes it easy to find your perfect job match. Connect with top companies in your desired area and advance your career in the high tech field. Sign up today and take the next step in your career journey with Expoint.