

Share
What you will do:
Own the resilience testing roadmap for vLLM and llm-d: define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD
Design GPU/accelerator-aware fault experiments that target vLLM and the stack beneath it (drivers, GPU Operator/DevicePlugin, NCCL/collectives, storage/network paths, NUMA/topology)
Build an automated harness (preferably extending krkn-chaos (https://github.com/krkn-chaos/krkn) ) to run controlled experiments with scoped blast radius, and evidence capture (logs, traces, metrics)
Integrate fault signals into pipelines (GitHub Actions or otherwise) as resilience gates alongside performance gates
Develop detection and diagnostics: dashboards and alerts for pre-fault signals (e.g., vLLM queue depth, GPU throttling, P2P downgrades, KV-cache pressure, allocator fragmentation)
Triage and root-cause resilience regressions from field/customer issues; upstream bugs and fixes to vLLM and llm-d
Explore and experiment with emerging AI technologies relevant to software development and testing, proactively identifying opportunities to incorporate new AI capabilities into existing workflows and tooling.
Publish learnings (internal/external): failure patterns, playbooks, SLO templates, experiment libraries, and reference architectures; present at internal/external forums
What you will bring:
3+ years in reliability, and/or performance engineering on large-scale distributed systems
Expertise in systems‑level software design
Expertise with Kubernetes and modern LLM inference server stack (e.g., vLLM, TensorRT-LLM, TGI)
Observability & forensics skills with experience with Prometheus/Grafana, OpenTelemetry tracing, eBPF/BPFTrace/perf, Nsight Systems, PyTorch Profiler; adept at converting raw signals into actionable narratives.
Fluency in Python (data & ML), strong Bash/Linux skills
Exceptional communication skills - able to translate raw data into customer value and executive narratives
Commitment to open‑source values and upstream collaboration
The following is considered a plus:
Master’s or PhD in Computer Science, AI, or a related field
History of upstream contributions and community leadership, public talks or blogs on resilience, or chaos engineering
Competitive benchmarking and failure characterization at scale.
The salary range for this position is $127,890.00 - $211,180.00. Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit

Share
About The Job
We’re not just looking for candidates who meet all the requirements, we’re looking for people who are excited about working with us and growing their career at Red Hat. We want to be transparent about what would make you most successful in the role but if you are excited by reading the job description and feel like you are right for the role, we encourage you to apply. This position could lead to regular on-site work with clients across North America, so a willingness to travel to customer locations up to 30-40 weeks per year is required. Applicants must reside within close proximity to a primary airport.
What You Will Do
•Assist in supporting customers in building enterprise technology infrastructures that are scalable, optimally managed, and adaptable to technological improvements using Red Hat technological solutions
•Focus on customer IT Automation and Enterprise Cloud Infrastructure solutions through deep technical hands-on work in these fields
•Continuously learn, grow, and adapt to new skills and technologies
•Work alongside leading financial services, retail, telecommunication, and institutional customers, though virtual and on-site collaboration
After joining Red Hat, you will go through an intensive customized training program on Red Hat technologies and Consulting solutions. Here's how your skill set will evolve and what you'll learn during your first year in the role:
•A baseline understanding of how to build technical solutions, integrated with existing enterprise systems, with technical guidance
•Learn technologies and consulting skills to enhance your abilities through enablements designed and taught by Red Hat experts and Red Hat certification
•Gain exposure and collaboration within Red Hat Services & the larger organization through everyday networking and community events
Within 3 months, be ready to deliver a project by attaining the following:
•Knowledge of how a customer use case can be developed into a project plan and how those requirements align with Red Hat’s technologies
•Continue expanding your knowledge and network, both internal and external, through enablement, communities, customers, and meetups
Within 6 months, begin to demonstrate technical leadership by accomplishing the following:
•Successfully implementing enterprise solutions in customer environments as part of delivery team
•Engage and share with our internal and external communities of practice on lessons learned, best practices, and how-tos
What You Will Bring
•Experience with delivering an technical implementation as part of a project or team
•Capable of contributing to technical projects through sustained teamwork and collaboration, ensuring the development of practical solutions.
•Ability to be well-organized in a fast-paced, ever-changing environment
•Ability to interact directly with customers across roles and organizations and clearly communicate technical and non-technical concepts
•Demonstrates ability to adapt quickly to new and unknown situations, ranging from managing deliverables to learning new technologies.
•Practical experience with at least one coding or scripting language. Examples include but are not limited to Java, Python, C++, YAML, Bash, JavaScript, React, etc.
•Familiarity with backend software development methodologies, frameworks, and development principles, including Agile, Code Management (Git), Software Development Life Cycle, etc.
•Interest in diving deep into backend software development, IT automation, cloud infrastructure, CI/CD, DevOps, and Artificial Intelligence
•Knowledge of and some experience with at least one Red Hat technology such as Red Hat Enterprise Linux, Red Hat OpenShift, or Red Hat Ansible is a plus
•Prior experience working in a customer-facing role is preferred
•Familiarity with open source software and open source as a business model is a plus
•Knowledge of Red Hat's product portfolio and subscription business model is a plus
The salary range for this position is $75,320.00 - $120,480.00. Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit

Share
About The Job
You will forge relationships with your partners, develop a deep technical understanding of their Red Hat implementation, share technical best practices, and act as point of contact for any major incidents, managing the partner’s expectations and communications through resolution of such incidents. You will tailor support for each partner, work closely with the extended virtual account team and advocate on their behalf. At the same time, you'll work closely with our engineering, R&D, product management, and technical support teams to debug, test, and resolve issues. As a PTAM, you will be supported in your career with continuous learning offerings, certification support, and challenging growth opportunities.
What You Will Do
• Develop relationships with key business and IT stakeholders and become an expert on a partner’s solutions by understanding their top business goals and priorities
• Perform technical reviews and share knowledge to proactively identify and prevent issues
• Support technology partners implementing automated and containerized cloud application platform solutions
• Learn new technologies quickly, including topics like container orchestration, container registries, container build strategies, and microservices on container platforms
• Establish and maintain parity with Red Hat cloud technologies strategy
• Engage product engineering teams to help develop solution patterns, based on partner engagements, as well as personal experience, that drive platform adoption
• Communicate how specific Red Hat cloud solutions and our cloud roadmap align to partner use cases
• Forewarn partners of technology changes or potential disruptions to their service and advise on mitigation strategies
• Provide advice and guidance to partners about current and future Red Hat products
• Identify training opportunities and work with our learning and enablement teams to provide targeted training to partner support personnel
• Troubleshoot technical issues and drive issue escalation with Red Hat, partner and customer teams
• Complete analysis and present periodic reviews of operational performance to leadership
• Manage partner support cases and maintain clear and concise case documentation
• Create partner engagement plans and keep documentation relevant to a partner's solution updated
• Manage and grow partner relationships by delivering attentive, relationship-based support
• Build a sense of trust with partners and serve as their advocate within Red Hat
• Contribute internally to the Red Hat team, share knowledge and best practices with team members, contribute to internal projects and initiatives, and serve as a Subject Matter Expert (SME) and mentor for specific technical or process areas
• Travel, as necessary, to visit partners and attend events
What You Will Bring
• Hands-on experience with operating Kubernetes or Kubernetes-based platforms like Red Hat OpenShift Container Platform.
• Expertise with containers and container management
• 3+ years of Linux or UNIX system administration experience
• Experience with cloud or server virtualization
• Experience with Linux, preferably Red Hat Enterprise Linux (RHEL) or a derivative
• Ability to manage and grow existing enterprise partner relationships by delivering proactive, relationship-based support
• Outstanding verbal and written communication skills; ability to convey complex information to partners clearly and concisely
• Competent comprehension of enterprise architecture and strategic business drivers
• Ability to manage multiple issues and projects with an eye for detail
• Direct experience with a variety of technology partners
• Experience with training and presentation delivery
The following are considered a plus
• Experience in a support, operations, development, engineering, or quality assurance organization
• Red Hat Certified Engineer (RHCE) or equivalent experience
• Red Hat Certified Specialist in OpenShift Administration or equivalent experience
• Experience with Amazon Web Services, Azure, Google Cloud is a plus
• Bachelor's degree in a technology-related discipline, preferably computer science or engineering
• Experience working in DevOps environments
• Prior experience in a technical leadership or mentorship role
• Technical knowledge of the Linux kernel and Linux file system
• Expertise with enterprise cloud solutions such as Platform-as-a-Service (PaaS), Infrastructure-as-a-Service (IaaS), and Software-as-a-Service (SaaS)
• Expertise with cloud management (Red Hat CloudForms, Cloud Formation, Terraform, etc) and IT Automation (Red Hat Ansible)
• Software engineering background; experience with RPM-based Linux and Java technologies
• Experience containerizing applications for deployment in cloud environments
• Good comprehension of continuous integration (CI) and continuous delivery (CD) concepts
• Familiarity with source code management tools like Git or SVN
• Knowledge of OpenShift Container Storage and OpenShift Data Foundation
• Experience in software-defined storage technologies like, Ceph, Gluster, or other enterprise storage platforms
• Experience in storage configuration, deployment, administration
The salary range for this position is $94,550.00 - $151,170.00. Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit

Share
What you will do:
Commitment to providing an exceptional customer experience by using professional communication and applying product knowledge and deep troubleshooting to perform direct actions in cluster environments to resolve various issues.
Contribute to global initiatives and projects to constantly reduce customer effort, improve tooling, and design and write automation software to improve efficiency.
Act as the direct contact and advisor for customer inquiries and issues with their Cloud Services through our Customer Portal, conference calls, and remote access.
Proactively analyze cluster status, identify single points of failure and other high-risk architecture issues; propose and implement more resilient resolutions.
Record customer interactions including investigation, troubleshooting, and resolution of issues, to document diagnostic steps and issue resolution to create reusable solutions for future incidents.
Create and maintain knowledge articles aligned with the KCS (Knowledge-Centered Service) methodology.
Partner with internal teams and external parties to deliver seamless infrastructure support for Red Hat’s Cloud Services.
Manage incident and issue workloads to ensure that all customer issues are handled and resolved in a timely manner.
Maintain a strong work ethic, able to work effectively as part of a team, and focus on customers and resolving their issues.
Be available to perform weekend shift duties on a rotational schedule.
What you will bring:
5+ years of experience in a customer-facing technical support or solutions engineering role.
Proven experience in Infrastructure Implementation, Deployment, Administration, and Production Support of container technologies and orchestration platforms (e.g., CRI-O, Kubernetes, xKS, Docker, OpenShift Container Platform).
Experience with developer workflows, Continuous Integration (e.g., Jenkins), and Continuous Deployment paradigms.
Exceptional technical, analytical, and troubleshooting skills using tools like curl, strace, oc (kubectl), and Wireshark analysis to investigate and form precise action plans for issue remediation with components such as networking, system performance issues, Kubernetes, OpenShift Container Platform, Service Mesh, and RESTful API calls.
Experience working with tools surrounding the Kubernetes ecosystem such as Prometheus, Grafana, FluentD, etc.
Experience working with configuration management tools (e.g., Ansible, Terraform) and monitoring and automation tools (e.g., Ansible, Splunk).
Proficient scripting and automation skills (e.g., Python, Bash, Go) to convert manual and maintenance functions into fully orchestrated automation is a plus.
Ability to operate in complex, highly secure, and highly available environments and interact with Site Reliability Engineering (SRE) domain experts maintaining those environments.
Familiarity with established ITIL practices such as Incident, Change, Problem, and Release Management.
Excellent English communication skills (written and verbal) and interpersonal skills, with a desire to mentor other members of the support team and share technical knowledge in a helpful and timely fashion.
Experience logging issues and working with issue tracking tools such as Jira.
Ability to work effectively as part of an agile team, actively communicate status, and complete deliverables on schedule with a strong sense of initiative and ownership.
Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
Ability to work effectively and collaborate within a geographically distributed, global team.
The salary range for this position is $84,400.00 - $134,970.00. Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit

Share
What you will do:
Commitment to providing an exceptional customer experience by using professional communication and applying product knowledge and deep troubleshooting to perform direct actions in cluster environments to resolve various issues.
Contribute to global initiatives and projects to constantly reduce customer effort, improve tooling, and design and write automation software to improve efficiency.
Act as the direct contact and advisor for customer inquiries and issues with their Cloud Services through our Customer Portal, conference calls, and remote access.
Proactively analyze cluster status, identify single points of failure and other high-risk architecture issues; propose and implement more resilient resolutions.
Record customer interactions including investigation, troubleshooting, and resolution of issues, to document diagnostic steps and issue resolution to create reusable solutions for future incidents.
Create and maintain knowledge articles aligned with the KCS (Knowledge-Centered Service) methodology.
Partner with internal teams and external parties to deliver seamless infrastructure support for Red Hat’s Cloud Services.
Manage incident and issue workloads to ensure that all customer issues are handled and resolved in a timely manner.
Maintain a strong work ethic, able to work effectively as part of a team, and focus on customers and resolving their issues.
Be available to perform weekend shift duties on a rotational schedule.
What you will bring:
5+ years of experience in a customer-facing technical support or solutions engineering role.
Proven experience in Infrastructure Implementation, Deployment, Administration, and Production Support of container technologies and orchestration platforms (e.g., CRI-O, Kubernetes, xKS, Docker, OpenShift Container Platform).
Experience with developer workflows, Continuous Integration (e.g., Jenkins), and Continuous Deployment paradigms.
Exceptional technical, analytical, and troubleshooting skills using tools like curl, strace, oc (kubectl), and Wireshark analysis to investigate and form precise action plans for issue remediation with components such as networking, system performance issues, Kubernetes, OpenShift Container Platform, Service Mesh, and RESTful API calls.
Experience working with tools surrounding the Kubernetes ecosystem such as Prometheus, Grafana, FluentD, etc.
Experience working with configuration management tools (e.g., Ansible, Terraform) and monitoring and automation tools (e.g., Ansible, Splunk).
Proficient scripting and automation skills (e.g., Python, Bash, Go) to convert manual and maintenance functions into fully orchestrated automation is a plus.
Ability to operate in complex, highly secure, and highly available environments and interact with Site Reliability Engineering (SRE) domain experts maintaining those environments.
Familiarity with established ITIL practices such as Incident, Change, Problem, and Release Management.
Excellent English communication skills (written and verbal) and interpersonal skills, with a desire to mentor other members of the support team and share technical knowledge in a helpful and timely fashion.
Experience logging issues and working with issue tracking tools such as Jira.
Ability to work effectively as part of an agile team, actively communicate status, and complete deliverables on schedule with a strong sense of initiative and ownership.
Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
Ability to work effectively and collaborate within a geographically distributed, global team.
The salary range for this position is $84,400.00 - $134,970.00. Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit

Share
About The Job
You will forge relationships with your partners, develop a deep technical understanding of their Red Hat implementation, share technical best practices, and act as point of contact for any major incidents, managing the partner’s expectations and communications through resolution of such incidents. You will tailor support for each partner, work closely with the extended virtual account team and advocate on their behalf. At the same time, you'll work closely with our engineering, R&D, product management, and technical support teams to debug, test, and resolve issues. As a PTAM, you will be supported in your career with continuous learning offerings, certification support, and challenging growth opportunities.
What You Will Do
• Develop relationships with key business and IT stakeholders and become an expert on a partner’s solutions by understanding their top business goals and priorities
• Perform technical reviews and share knowledge to proactively identify and prevent issues
• Support technology partners implementing automated and containerized cloud application platform solutions
• Learn new technologies quickly, including topics like container orchestration, container registries, container build strategies, and microservices on container platforms
• Establish and maintain parity with Red Hat cloud technologies strategy
• Engage product engineering teams to help develop solution patterns, based on partner engagements, as well as personal experience, that drive platform adoption
• Communicate how specific Red Hat cloud solutions and our cloud roadmap align to partner use cases
• Forewarn partners of technology changes or potential disruptions to their service and advise on mitigation strategies
• Provide advice and guidance to partners about current and future Red Hat products
• Identify training opportunities and work with our learning and enablement teams to provide targeted training to partner support personnel
• Troubleshoot technical issues and drive issue escalation with Red Hat, partner and customer teams
• Complete analysis and present periodic reviews of operational performance to leadership
• Manage partner support cases and maintain clear and concise case documentation
• Create partner engagement plans and keep documentation relevant to a partner's solution updated
• Manage and grow partner relationships by delivering attentive, relationship-based support
• Build a sense of trust with partners and serve as their advocate within Red Hat
• Contribute internally to the Red Hat team, share knowledge and best practices with team members, contribute to internal projects and initiatives, and serve as a Subject Matter Expert (SME) and mentor for specific technical or process areas
• Travel, as necessary, to visit partners and attend events
What You Will Bring
• Hands-on experience with operating Kubernetes or Kubernetes-based platforms like Red Hat OpenShift Container Platform.
• Expertise with containers and container management
• 3+ years of Linux or UNIX system administration experience
• Experience with cloud or server virtualization
• Experience with Linux, preferably Red Hat Enterprise Linux (RHEL) or a derivative
• Ability to manage and grow existing enterprise partner relationships by delivering proactive, relationship-based support
• Outstanding verbal and written communication skills; ability to convey complex information to partners clearly and concisely
• Competent comprehension of enterprise architecture and strategic business drivers
• Ability to manage multiple issues and projects with an eye for detail
• Direct experience with a variety of technology partners
• Experience with training and presentation delivery
The following are considered a plus
• Experience in a support, operations, development, engineering, or quality assurance organization
• Red Hat Certified Engineer (RHCE) or equivalent experience
• Red Hat Certified Specialist in OpenShift Administration or equivalent experience
• Experience with Amazon Web Services, Azure, Google Cloud is a plus
• Bachelor's degree in a technology-related discipline, preferably computer science or engineering
• Experience working in DevOps environments
• Prior experience in a technical leadership or mentorship role
• Technical knowledge of the Linux kernel and Linux file system
• Expertise with enterprise cloud solutions such as Platform-as-a-Service (PaaS), Infrastructure-as-a-Service (IaaS), and Software-as-a-Service (SaaS)
• Expertise with cloud management (Red Hat CloudForms, Cloud Formation, Terraform, etc) and IT Automation (Red Hat Ansible)
• Software engineering background; experience with RPM-based Linux and Java technologies
• Experience containerizing applications for deployment in cloud environments
• Good comprehension of continuous integration (CI) and continuous delivery (CD) concepts
• Familiarity with source code management tools like Git or SVN
• Knowledge of OpenShift Container Storage and OpenShift Data Foundation
• Experience in software-defined storage technologies like, Ceph, Gluster, or other enterprise storage platforms
• Experience in storage configuration, deployment, administration
The salary range for this position is $94,550.00 - $151,170.00. Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit

Share
About the job:
Red Hat is hiring a Specialist Solution Architect (OpenShift) for Financial Services North America. The Specialist will drive Sales of the OpenShift Platform. In this role, you will be the catalyst creating opportunities, solving problems, and establishing working relationships with customers in our key FSI enterprise accounts and service provider partners. You’ll need to possess excellent communication and people skills balanced with technical expertise, passion for open source, and a thorough understanding of business and IT challenges encountered in financial services. Through a series of structured in-person interactions the Specialist Solution Architect will win the trust and confidence of customer engineering, development and operations teams by aligning their requirements and use cases with the functional capabilities of the OpenShift Platform.
What you will do:
Act as a technical advisor, guiding customers from presales to post-sales implementation, ensuring successful deployments.
Lead technical validation through demos, workshops, and pilot projects to align customer needs with Ansible capabilities.
Develop reusable solution frameworks and content to empower sales teams and standardize customer outcomes.
Collaborate with product teams to enhance customer experience and advocate for customer needs internally.
Assist your team with responding to RFPs for customer success
What you will bring:
Technical Skills:
Expertise in the OpenShift Platform (certifications preferred) or Kubernetes. Strong hands on skill.
5+ years in OpenShift or Kubernbets; 5-10 years in architecture/development/consulting roles.
Proficiency in Linux and DevOps methodologies.
Proficiency in configuring accelerators such as NVIDIA, AMD or Intel GPUs.
Business Skills:
Ability to engage engineers and architects, address enterprise IT challenges, and propose cross-platform solutions.
Experience building relationships across large IT organizations and managing end-to-end proof-of-concept processes.
Preferred Qualifications:
Red Hat certifications (Red Hat Certified OpenShift Architect, RHCE) and a degree in Computer Science/Engineering.
Thought leadership through industry contributions (whitepapers, conferences, etc.) and staying updated on Kubernetes and AI trends.
The salary range for this position is $177,540.00 - $283,950.00 (inclusive of base pay + target incentive compensation). Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit

Share
What you will do:
Own the resilience testing roadmap for vLLM and llm-d: define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD
Design GPU/accelerator-aware fault experiments that target vLLM and the stack beneath it (drivers, GPU Operator/DevicePlugin, NCCL/collectives, storage/network paths, NUMA/topology)
Build an automated harness (preferably extending krkn-chaos (https://github.com/krkn-chaos/krkn) ) to run controlled experiments with scoped blast radius, and evidence capture (logs, traces, metrics)
Integrate fault signals into pipelines (GitHub Actions or otherwise) as resilience gates alongside performance gates
Develop detection and diagnostics: dashboards and alerts for pre-fault signals (e.g., vLLM queue depth, GPU throttling, P2P downgrades, KV-cache pressure, allocator fragmentation)
Triage and root-cause resilience regressions from field/customer issues; upstream bugs and fixes to vLLM and llm-d
Explore and experiment with emerging AI technologies relevant to software development and testing, proactively identifying opportunities to incorporate new AI capabilities into existing workflows and tooling.
Publish learnings (internal/external): failure patterns, playbooks, SLO templates, experiment libraries, and reference architectures; present at internal/external forums
What you will bring:
3+ years in reliability, and/or performance engineering on large-scale distributed systems
Expertise in systems‑level software design
Expertise with Kubernetes and modern LLM inference server stack (e.g., vLLM, TensorRT-LLM, TGI)
Observability & forensics skills with experience with Prometheus/Grafana, OpenTelemetry tracing, eBPF/BPFTrace/perf, Nsight Systems, PyTorch Profiler; adept at converting raw signals into actionable narratives.
Fluency in Python (data & ML), strong Bash/Linux skills
Exceptional communication skills - able to translate raw data into customer value and executive narratives
Commitment to open‑source values and upstream collaboration
The following is considered a plus:
Master’s or PhD in Computer Science, AI, or a related field
History of upstream contributions and community leadership, public talks or blogs on resilience, or chaos engineering
Competitive benchmarking and failure characterization at scale.
The salary range for this position is $127,890.00 - $211,180.00. Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit