דרושים Principal Infrastructure Sre - Compute Service ב-אנבידיה ב-United States, Texas

Working with tech giants to develop and demonstrate solutions based on NVIDIA’s groundbreaking software and hardware technologies. Partnering with Sales Account Managers and Developer Relations Managers to identify and secure...

US, WA, Seattle

time type: Full time

posted on: Posted Today

job requisition id

What you’ll be doing:

Working with tech giants to develop and demonstrate solutions based on NVIDIA’s groundbreaking software and hardware technologies.
Partnering with Sales Account Managers and Developer Relations Managers to identify and secure business opportunities for NVIDIA products and solutions.
Serving as the main technical point of contact for customers engaged in the development of intricate AI infrastructure, while also offering support in understanding performance aspects related to tasks like large scale LLM training and inference.
Conducting regular technical customer meetings for project/product details, feature discussions, introductions to new technologies, performance advice, and debugging sessions.
Collaborating with customers to build Proof of Concepts (PoCs) for solutions to address critical business needs and support cloud service integration for NVIDIA technology on hyperscalers.
Analyzing and developing solutions for customer performance issues for both AI and systems performance.

What we need to see:

BS/MS/PhD in Electrical/Computer Engineering, Computer Science, Physics, or other Engineering fields or equivalent experience.
4+ years of engineering(performance/system/solution)experience.
Hands-on experience building performance benchmarks for data center systems, including large scale AI training and inference.
Understanding of systems architecture including AI accelerators and networking as it relates to the performance of an overall application.
Effective engineering program management with the capability of balancing multiple tasks.
Ability to communicate ideas clearly through documents, presentations, and in external customer-facing environments.

Ways to stand out from the crowd:

Hands-on experience with Deep Learning frameworks (PyTorch, JAX, etc.), compilers (Triton, XLA, etc.), and NVIDIA libraries (TRTLLM, TensorRT, Nemo, NCCL, RAPIDS, etc.).
Familiarity with deep learning architectures and the latest LLM developments.
Background with NVIDIA hardware and software, performance tuning, and error diagnostics.
Hands-on experience with GPU systems in general including but not limited to performance testing, performance tuning, and benchmarking.
Experience deploying solutions in cloud environments including AWS, GCP, Azure, or OCI as well as knowledge of DevOps/MLOps technologies such as Docker/containers, Kubernetes, data center deployments, etc. Command line proficiency.

You will also be eligible for equity and .

משרות נוספות שיכולות לעניין אותך

Nvidia Senior Solutions Architect GPU - Cloud Service Providers United States, Texas

Nvidia Senior Solutions Architect Networking - Cloud Service Provid... United States, Texas

23.11.2025

Nvidia Senior DGX Cloud Software Engineer - Infrastructure United States, Texas

שיתוף

Design, build, and run cloud infrastructure services in scope to meet our business goals performing integrations, migrations, bringups, updates, and decommissions as necessary. Participate in the definition of our internal...

time type: Full time

posted on: Posted Yesterday

job requisition id

What you’ll be doing:

Design, build, and run cloud infrastructure services in scope to meet our business goals performing integrations, migrations, bringups, updates, and decommissions as necessary.
Participate in the definition of our internal facing service level objectives and error budgets as part of our overall observability strategy.
Eliminate toil or automate it where the ROI of building and maintaining automation is worth it.
Practice sustainable blameless incident prevention and incident response while being a member of an on-call rotation.
Consult with and provide consultation for peer teams on systems design best practices.
Participate in a supportive culture of values-driven introspection, communication, and self-organization

What we need to see:

Proficiency in one or more of the following programming languages: Python or Go
BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics) or equivalent experience.
5+ years of relevant experience in infrastructure and fleet management engineering.
Experience with infrastructure automation and distributed systems design developing tools for running large scale private or public cloud systems at scales requiring fully automated management and under active customer consumption in production.
A track record demonstrating a mix of initiating your own projects, convincing others to collaborate with you, and collaborating well on projects initiated by others.
In-depth knowledge in one or more of the following: Linux, Slurm, Kubernetes, Local and Distributed Storage, and Systems Networking.

Ways to stand out from the crowd:

Demonstrating a systematic problem-solving approach, coupled with clear communication skills and a willingness to take ownership and get results such as experience driving a build / reuse / buy decision.
Experience working with or developing bare metal as a service (BMaaS) associated systems. For example, vending BMaaS, or Slurm running on containers, or vending Kubernetes clusters. Experience working with or developing multi-cloud infrastructure services. Experience teaching reliability engineering (e.g. SRE) and/or other scale-oriented cloud systems practices to peers and/or other companies (e.g. CRE). Experience in running private or public cloud systems based on one or more of Kubernetes, OpenStack, Docker or Slurm.
Experience with accelerated compute and communications technologies such BlueField Networking, Infiniband topologies, NVMesh, and/or the NVIDIA Collective Communication Library (NCCL).
Experience working with a centralized security organization to prioritize and mitigate security risks. Prior experience in a ML/AI focused role or on a team matching specific keywords is welcome but not required.

You will also be eligible for equity and .

משרות נוספות שיכולות לעניין אותך

23.11.2025

Nvidia Senior Infrastructure Build Systems Engineer United States, Texas

שיתוף

Building and maintaining infrastructure from first principles needed to deliver TensorRT LLM. Maintain CI/CD pipelines to automate the build, test, and deployment process and build improvements on the bottlenecks. Managing...

US, CA, Remote

time type: Full time

posted on: Posted 5 Days Ago

job requisition id

What you'll be doing:

Building and maintaining infrastructure from first principles needed to deliver TensorRT LLM
Maintain CI/CD pipelines to automate the build, test, and deployment process and build improvements on the bottlenecks. Managing tools and enabling automations for redundant manual workflows via Github Actions, Gitlab, Terraform, etc
Enable performing scans and handling of security CVEs for infrastructure components
Improve the modularity of our build systems using CMake
Use AI to help build automated triaging workflows
Extensive collaboration with cross-functional teams to integrate pipelines from deep learning frameworks and components is essential to ensuring seamless deployment and inference of deep learning models on our platform.

What we need to see:

Masters degree or equivalent experience
3+ years of experience in Computer Science, computer architecture, or related field
Ability to work in a fast-paced, agile team environment
Excellent Bash, CI/CD, Python programming and software design skills, including debugging, performance analysis, and test design.
Experience with CMake.
Background with Security best practices for releasing libraries.
Experience in administering, monitoring, and deploying systems and services on GitHub and cloud platforms. Support other technical teams in monitoring operating efficiencies of the platform, and responding as needs arise.
Highly skilled in Kubernetes and Docker/containerd. Automation expert with hands-on skills in frameworks like Ansible & Terraform. Experience in AWS, Azure or GCP

Ways to stand out from the crowd:

Experience contributing to a large open-source deep learning community - use of GitHub, bug tracking, branching and merging code, OSS licensing issues handling patches, etc.
Experience in defining and leading the DevOps strategy (design patterns, reliability and scaling) for a team or organization.
Experience driving efficiencies in software architecture, creating metrics, implementing infrastructure as code and other automation improvements.
Deep understanding of test automation infrastructure, framework and test analysis.
Excellent problem solving abilities spanning multiple software (storage systems, kernels and containers) as well as collaborating within an agile team environment to prioritize deep learning-specific features and capabilities within Triton Inference Server, employing advanced troubleshooting and debugging techniques to resolve complex technical issues.

You will also be eligible for equity and .

משרות נוספות שיכולות לעניין אותך

17.11.2025

Nvidia Senior AI Infrastructure Software Engineer United States, Texas

שיתוף

Design, develop, and improve scalable infrastructure to support the next generation of AI applications, including copilots and agentic tools. Drive improvements in architecture, performance, and reliability, enabling teams to bring...

time type: Full time

posted on: Posted 3 Days Ago

job requisition id

You will collaborate closely with researchers to design and scale agents - enabling them to reason, plan, call tools and code just like human engineers. You will work on building and maintaining the core infrastructure for deploying and running these agents in production, powering all our agentic tools and applications and ensuring their seamless and efficient performance. If you're passionate about the latest research and cutting-edge technologies shaping generative AI, this role and team offer an exciting opportunity to be at the forefront of innovation.

What you'll be doing:

Design, develop, and improve scalable infrastructure to support the next generation of AI applications, including copilots and agentic tools.
Drive improvements in architecture, performance, and reliability, enabling teams to bring to bear LLMs and advanced agent frameworks at scale.
Collaborate across hardware, software, and research teams, mentoring and supporting peers while encouraging best engineering practices and a culture of technical excellence.
Stay informed of the latest advancements in AI infrastructure and contribute to continuous innovation across the organization.

What we need to see:

Master or PhD or equivalent experience in Computer Science or related field, with a minimum of 5 years in large-scale distributed systems or AIinfrastructure.
Advanced expertise in Python (required), strong experience with JavaScript, and deep knowledge of software engineering principles, OOP/functional programming, and writing high-performance, maintainable code.
Demonstrated expertise in crafting scalable microservices, web apps, SQL, and NoSQL databases (especially MongoDB and Redis) in production with containers, Kubernetes, and CI/CD.
Solid experience with distributed messaging systems (e.g., Kafka), and integrating event-driven or decoupled architectures into robust enterprise solutions.
Practical experience integrating and fine-tuning LLMs or agent frameworks (e.g., LangChain, LangGraph, AutoGen, OpenAI Functions, RAG, vector databases, timely engineering).
Demonstrated end-to-end ownership of engineering solutions, from architecture and development to deployment, integration, and ongoingoperations/support.
Excellent communication skills and a collaborative, proactive approach.

You will also be eligible for equity and .

משרות נוספות שיכולות לעניין אותך

16.11.2025

Nvidia Solutions Architect AI Infrastructure United States, Texas

שיתוף

Working with NVIDIA AI Native customers on data center GPU server and networking infrastructure deployments. Guiding customer discussions on network topologies, compute/storage, and supporting the bring-up ofserver/network/clusterdeployments. Identifying new project...

time type: Full time

posted on: Posted 3 Days Ago

job requisition id

What you'll be doing:

Working with NVIDIA AI Native customers on data center GPU server and networking infrastructure deployments.
Guiding customer discussions on network topologies, compute/storage, and supporting the bring-up ofserver/network/clusterdeployments.
Identifying new project opportunities for NVIDIA products and technology solutions in data center and AI applications.
Conducting regular technical meetings with customers as a trusted advisor, discussing product roadmaps, cluster debugging, and new technology introductions.
Building custom demonstrations and proofs of concept to address critical business needs.
Analyzing and debugging compute/network performance issues.

What we need to see:

BS/MS/PhD in Electrical/Computer Engineering, Computer Science, Physics, or related fields, or equivalent experience.
5+ years of experience in Solution Engineering or similar roles.
System-level understanding of server architecture, NICs, Linux, system software, and kernel drivers.
Practical knowledge of networking - switching & routing for Ethernet/Infiniband, and data center infrastructure (power/cooling).
Familiarity with DevOps/MLOps technologies such as Docker/containers and Kubernetes.
Effective time management and ability to balance multiple tasks.
Excellent communication skills for articulating ideas and code clearly through documents and presentations.

Ways to stand out from the crowd:

External customer-facing skills and experience.
Experience with the bring-up and deployment of large clusters.
Proficiency in systems engineering, coding, and debugging, including C/C++, Linux kernel, and drivers.
Hands-on experience with NVIDIA systems/SDKs (e.g., CUDA), NVIDIA networking technologies (e.g., DPU or equivalent experience, RoCE, InfiniBand), and/or ARM CPU solutions.
Familiarity with virtualization technology concepts.

You will also be eligible for equity and .

משרות נוספות שיכולות לעניין אותך

16.11.2025

Nvidia Senior Solutions Architect - Data Center Infrastructure United States, Texas

שיתוף

Lead the end-to-end execution for key Hyperscalers customers to optimally and rapidly go-to-market at scale with NVIDIA data center products (e.g., GB200). Partner with Hyperscalers Product Customer Lead to understand...

US, WA, Seattle

time type: Full time

posted on: Posted 4 Days Ago

job requisition id

As part of the NVIDIA Solutions Architecture team, you will navigate uncharted waters and gray space to drive successful market adoption by balancing strategic alignment, data-driven analysis, and tactical execution across engineering, product, and sales teams. You will serve as a critical liaison product strategy and large-scale customer deployment.

What you’ll be doing:

Lead the end-to-end execution for key Hyperscalers customers to optimally and rapidly go-to-market at scale with NVIDIA data center products (e.g., GB200).
Partner with Hyperscalers Product Customer Lead to understand strategy, define metrics, ensure alignment.
Data-Driven Execution: Collect, maintain, and analyze sophisticated data trends to assess the product's market health, identify themes, challenges, and opportunities, and guide the customer to resolution of technical roadblocks.
Problem Solving & Navigation: Navigate complex issues effectively, embodying a productive leader who balances short-term unblocks with long-term process and product improvements.
Executive Communication: Deliver concise, direct executive-level updates and regular status communications to multi-functional leadership on priorities, progress, and vital actions.
Process Improvement: Integrate insights from deployment challenges and customer feedback into future developments for processes and products through close partnership with Product and Engineering teams.

What we need to see:

BS/MS/PhD in Electrical/Computer Engineering, Computer Science, Physics, or other Engineering fields or equivalent experience.
8+ years of combined experience in Solutions Architecture, Technical Program Management, Product Management, System Reliability Engineer or other complex multi-functional roles.
Proven track record to lead and influence without direct authority across technical and business functions.
Proven analytical skills with experience in establishing benchmarks, collecting/analyzing intricate data, and redefining data into strategic themes, action items, and executive summaries.
Skilled in reviewing logs and deployment data, and aiding customers in resolving technical concerns (e.g., identifying performance issues associated with AI/ML and system architecture).

Ways to stand out from the crowd:

Lead multi-functional teams and influence interested parties to address challenges in customer datacenter deployments, ensuring cluster health and performance at scale.
Established track record of driving a product from the pilot phase to at-scale deployment in a data center environment.
Hands-on experience with NVIDIA hardware (e.g., H100, GB200) and software libraries, with an understanding of performance tuning and error diagnostics.
Knowledge of DevOps/MLOps technologies such as Docker/containers and Kubernetes, and their relationship to data center deployments.
Confirmed capacity to align, adopt, and disseminate insights among various internal teams (e.g., collaborating with other program leads).

You will also be eligible for equity and .

משרות נוספות שיכולות לעניין אותך

16.11.2025

Nvidia Principal Quantum Error Correction Research Scientist Applie... United States, Texas

שיתוף

Path-find technical innovations in Quantum Error Correction and Fault Tolerance, working with multi-functional teams in Product, Engineering, and Applied Research. Develop novel approaches to quantum error correction codes and their...

US, WA, Redmond

time type: Full time

posted on: Posted 2 Days Ago

job requisition id

What you'll be doing:

Path-find technical innovations in Quantum Error Correction and Fault Tolerance, working with multi-functional teams in Product, Engineering, and Applied Research
Develop novel approaches to quantum error correction codes and their logical operations, including methods for implementation and logical operation synthesis
Research and co-design improved methods to achieve fault tolerance, such as techniques for logical operations, concatenation, synthesis, distillation, cultivation, or others
Collaborate with internal teams and external partners on developing technology components to enable a fault-tolerant software stack integrated with quantum hardware
Adopt a culture of collaboration, rapid innovation, technical depth, and creative problem solving

What we need to see:

Masters degree in Physics, Computer Science, Chemistry, Applied Mathematics, or related engineering field or equivalent experience (Ph.D. preferred)
Extensive background in Quantum Information Science with 12+ overall years of experience in the Quantum Computing industry
A demonstrated ability to deliver high impact value in quantum error correction and fault tolerance

Ways to stand out from the crowd:

Hands-on experience in scientific computing, high-performance computing, applied machine learning, or deep learning
Experience with co-design of quantum error correction with quantum hardware or quantum applications
Experience with CUDA and NVIDIA GPUs
Passion to drive technology innovations into NVIDIA software and hardware products to support Quantum Computing

You will also be eligible for equity and .

NvidiaSenior Solutions Architect GPU - Cloud Service Providers

משרות נוספות שיכולות לעניין אותך

1 2 3 4 5 6

United States, Texas

971215478

24.11.2025

שיתוף

תיאור:

US, CA, Santa Clara

US, WA, Seattle

time type: Full time

posted on: Posted Today

job requisition id

What you’ll be doing:

Working with tech giants to develop and demonstrate solutions based on NVIDIA’s groundbreaking software and hardware technologies.
Partnering with Sales Account Managers and Developer Relations Managers to identify and secure business opportunities for NVIDIA products and solutions.
Serving as the main technical point of contact for customers engaged in the development of intricate AI infrastructure, while also offering support in understanding performance aspects related to tasks like large scale LLM training and inference.
Conducting regular technical customer meetings for project/product details, feature discussions, introductions to new technologies, performance advice, and debugging sessions.
Collaborating with customers to build Proof of Concepts (PoCs) for solutions to address critical business needs and support cloud service integration for NVIDIA technology on hyperscalers.
Analyzing and developing solutions for customer performance issues for both AI and systems performance.

What we need to see:

BS/MS/PhD in Electrical/Computer Engineering, Computer Science, Physics, or other Engineering fields or equivalent experience.
4+ years of engineering(performance/system/solution)experience.
Hands-on experience building performance benchmarks for data center systems, including large scale AI training and inference.
Understanding of systems architecture including AI accelerators and networking as it relates to the performance of an overall application.
Effective engineering program management with the capability of balancing multiple tasks.
Ability to communicate ideas clearly through documents, presentations, and in external customer-facing environments.

Ways to stand out from the crowd:

Hands-on experience with Deep Learning frameworks (PyTorch, JAX, etc.), compilers (Triton, XLA, etc.), and NVIDIA libraries (TRTLLM, TensorRT, Nemo, NCCL, RAPIDS, etc.).
Familiarity with deep learning architectures and the latest LLM developments.
Background with NVIDIA hardware and software, performance tuning, and error diagnostics.
Hands-on experience with GPU systems in general including but not limited to performance testing, performance tuning, and benchmarking.
Experience deploying solutions in cloud environments including AWS, GCP, Azure, or OCI as well as knowledge of DevOps/MLOps technologies such as Docker/containers, Kubernetes, data center deployments, etc. Command line proficiency.

You will also be eligible for equity and .