

NVIDIA is seeking a Sr. Systems Software Engineer for the Apache Spark Acceleration group. Over the past five years GPU accelerated data processing has moved from proof of concept to production deployments. Many enterprises are now recognizing the needs of accelerated computing to handle their large data processing needs. Multi-node GPU deployments will reduce cloud computing costs and lower latency batch ETL workloads.
At NVIDIA, we have been invested in accelerating Apache Spark, providing an open source plugin for Apache Spark. Apache Spark is the most popular data processing engine in data centers. We strive to accelerate Spark applications on GPUs without any code changes. We are passionate about working on hard problems that have an impact. You will need to have strong programming skills, a deep understanding of software development related to C++. You will work with a team that is using open source libraries like RAPIDS to accelerate reading, writing and batch data operations in Spark.
What you'll be doing:
Develop CUDA/C++ libraries to accelerate DataFrames and I/O operations on common file formats such as Parquet, ORC and JSON
Collaborate with distributed systems teams to craft solutions to distributed processing problems challenges at large scale
Work with open source communities to enhance libraries like RAPIDS, CCCL and UCX through technical discussion and code contributions
Provide recommendations and feedback to teams regarding decisions surrounding topics such as infrastructure, continuous integration and testing strategy
Build, test and optimize CUDA/C++ libraries across different platforms
What we need to see:
BS, MS, or PhD in Computer Science, Computer Engineering, or closely related field (or equivalent experience)
12+ years of work experience in software development
Outstanding technical skills in designing and implementing high-quality distributed systems
Excellent programming skills in C++, Java, and/or Scala
Ability to work with teams across organizational boundaries and geographies
Highly motivated with strong interpersonal skills
OS kernel dev experience is a strong plus
You will also be eligible for equity and .
משרות נוספות שיכולות לעניין אותך

What you’ll be doing:
As a Software Development Engineer in Test, you will take part in technical design and implementation of tests for NVIDIA software products with the goal to identify defects early in the software development lifecycle. You will also build tools that accelerate execution workflows for the organization. In this role you can expect to:
Design and implement automated tests incorporating AI technologies for NVIDIA's device driver software and SDKs on various Windows and Linux operating systems.
Develop automated end to end tests for NVIDIA device driver and SDKs on windows platform. Execute manual and automated tests, identify, and report defects. Measure code coverage, analyze and drive code coverage improvements.
Develop applications and tools that bring data driven insights to development and test workflows.
Build tools/utility/framework inPython / C / C++ which would help automate and optimize the testing workflows in GPU domain.
Write maintainable, reliable, and well detailed code. Debug issues to identify the root cause. Provide peer code reviews including feedback on performance, scalability, and correctness
Optimally estimate and prioritize tasks in order to create a realistic delivery schedule and work on challenging technical and process issues.
Generate and test compatibility across a range of products and interfaces.
Work closely with leadership to report progress by generating effective and impactful reports
What we need to see:
B.E./B. Tech degree inComputer Science/IT/Electronics engineeringwith strong academics or equivalent experience
5+ years of programming experience in Python/C/C++ with experience in applying Object-Oriented Programming concepts.
Hands-on knowledge of developing Python scripts with application development concepts like dictionaries, tuples, RegEx, PIP etc.
Good experience with using AI development tools for test plans creation, test cases development and test cases automation
Experience with testing RESTful APIs and the ability to conduct performance and load testing to ensure the application can handle high traffic and usage.
Experience working with databases and storage technologies like SQL and Elasticsearch
Good understanding of OS fundamentals, PC Hardware and troubleshooting.
Skillful at debugging issues and have experience using debugging tools like WinDBG/gdb
The ability to collaborate with multiple development teams to gain knowledge and improve test code coverage
Excellent written, verbal, analytical and problem-solving skills and ability to work with a team of engineers in a fast paced environment
Ways to stand out from the crowd:
Prior project experience with building ML and DL based applications would be a plus
Good understanding of testing fundamentals
Good problem solving skills (solid logic to apply in isolation and regression of issues found).
You will also be eligible for equity and .

What you'll be doing:
Lead documentation planning and prioritization sessions with cross-functional partners, embedding documentation requirements into Product Sprint Goal PRDs from day one
Manage documentation workflow using Kanban, maintaining clear ownership, dependencies, and status visibility while tracking delivery through sprint cycles and product releases
Champion Context Kits (structured prompts and guidelines) that help distributed teams build quality documentation with AI assistance, streamlining reporting and revealing operational insights
Report on critical metrics including coverage, cycle time, sprint predictability, and developer satisfaction, identifying blockers and resolving delivery impediments
Work closely with Technical Program Managers to integrate documentation checkpoints into release trains, facilitating backlog refinement, stand-ups, and retrospectives
What we need to see:
Bachelors Degree (or equivalent experience) with 8+ years in program management, technical operations, or agile delivery with strong proficiency in Jira, Confluence, and agile tracking tools
Proven track record coordinating work across matrixed organizations with clear communication style—leading effective meetings, writing streamlined updates, and aligning collaborators
Active AI tool user (ChatGPT, Claude, Copilot, or similar) who demonstrates data-driven decision-making and can influence without authority across Product, Engineering, Marketing, and Customer Success
Ways to stand out from the crowd:
Experience coordinating developer documentation in platform or SaaS companies, working alongside Technical Program Managers in complex product organizations
Hands-on experience with Agile/Scrum/Kanban, continuous delivery, and docs-as-code workflows (Git, Markdown, static site generators)
Demonstrated process improvements that measurably boosted team efficiency, with knowledge of developer platforms, SDKs, APIs, or simulation technologies
Background in gaming, graphics, AI, or high-performance computing with proven AI workflow optimization—status reports, meeting summaries, workflow analysis, documentation reviews
Developed tailored GPTs, prompt libraries, Context Kits, or reusable templates that optimized team efficiency and content quality
You will also be eligible for equity and .

Are you a rare mix of technical depth, ecosystem savvy, and
What You’ll Be Doing:
Engage and support ISVs, system integrators, and manufacturers using AI to transform industrial operations.
Partner with developers to help them integrate NVIDIA’s latest vision AI technologies into scalable industrial solutions.
Collaborate with product, engineering, and marketing teams to amplify developer enablement and ecosystem growth.
Drive early adoption of NVIDIA Metropolis and related SDKs, ensuring partner success through hands-on guidance and technical onboarding.
Identify and elevate lighthouse partners demonstrating best-in-class industrial AI use cases.
What We Need To See:
8yrs of proven ability in a technical or developer-facing role, ideally within AI, industrial automation, or OT systems integration.
Bachelor’s or advanced degree in computer science, engineering, or related field, or equivalent experience.
Proven success building and supporting developer ecosystems or partner networks.
Strong technical understanding of AI, machine learning, video analytics, or related technologies.
Excellent communication andrelationship-buildingskills, with the ability to convey complex technical concepts clearly across technical and business audiences.
Ability to collaborate multi-functionally to accelerate adoption and scale developer success.
Ways To Stand Out From The Crowd:
Experience applying AI in manufacturing or industrial automation, including computer vision, robotics, or digital-twin workflows.
Experience in manufacturing operations or adjacent industrial fields, with a deep understanding of the unique challenges, risk tolerance, and change-management dynamics that shape technology adoption in these traditionally conservative industries.
Hands-on familiarity with AI for computer vision, robotics, or NVIDIA platforms (Metropolis, Omniverse, CUDA-X)
Proven success enabling ISVs, manufacturers, or system integrators in sectors such as manufacturing, logistics, or energy, guiding them from pilot to scaled deployment.
Passion for helping developers and operators bridge the gap between innovation and production reality through intelligent, AI-powered systems.
You will also be eligible for equity and .

What you'll be doing:
Lead, mentor, and scale a high-performing engineering team focused on deep learning inference and GPU-accelerated software.
Drive the strategy, roadmap, and execution of NVIDIA’s inference frameworks engineering, focusing on SGLang.
Partner with internal compiler, libraries, and research teams to deliver end-to-end optimized inference pipelines across NVIDIA accelerators.
Oversee performance tuning, profiling, and optimization of large-scale models for LLM, multimodal, and generative AI applications.
Guide engineers in adopting best practices for CUDA, Triton, CUTLASS, and multi-GPU communications (NIXL, NCCL, NVSHMEM).
Represent the team in roadmap and planning discussions, ensuring alignment with NVIDIA’s broader AI and software strategies.
Foster a culture of technical excellence, open collaboration, and continuous innovation.
What we need to see:
MS, PhD, or equivalent experience in Computer Science, Electrical/Computer Engineering, or a related field.
6+ years of software development experience, including 3+ years in technical leadership or engineering management.
Strong background in C/C++ software design and development; proficiency in Python is a plus.
Hands-on experience with GPU programming (CUDA, Triton, CUTLASS) and performance optimization.
Proven record of deploying or optimizing deep learning models in production environments.
Experience leading teams using Agile or collaborative software development practices.
Ways to Stand out from The Crowd
Significant open-source contributions to deep learning or inference frameworks such as PyTorch, vLLM, SGLang, Triton, or TensorRT-LLM.
Deep understanding of multi-GPU communications (NIXL, NCCL, NVSHMEM) and distributed inference architectures.
Expertise in performance modeling, profiling, and system-level optimization across CPU and GPU platforms.
Proven ability to mentor engineers, guide architectural decisions, and deliver complex projects with measurable impact.
Publications, patents, or talks on LLM serving, model optimization, or GPU performance engineering.
You will also be eligible for equity and .

What you’ll be doing:
Architect and Build Scalable Systems: Drive the design and implementation of the AON profiling service's core systems. You'll master inter-process communication (IPC), memory management, and building low-overhead architectures to handle profiling data from complex multi-node, multi-process, multi-GPU, and cluster environments.
Elevate Software Engineering Excellence: Promote high standards in software development, including design patterns, concurrency, parallelism, and advanced debugging for asynchronous systems. Our commitment to code quality and robust testing ensures a reliable profiling service.
Lead, Mentor, and Innovate: Guide and mentor engineers, provides impactful code reviews, and shape technical roadmaps. Proactively identify complex technical issues within the AON project, break them down, and craft innovative solutions. Your problem-solving prowess will be crucial for AON's success with ML workloads.
Architect and Build High-Performance Platforms: Transform user needs into clear requirements and design documents. Explore diverse approaches to problems, making well-reasoned recommendations. Lead end-to-end feature development—from planning and prototyping to implementation, testing, and customer evaluation. This involves hands-on development across user applications, drivers, performance counter libraries, and lower-level platform/hardware abstraction layers.
Collaborate Across Boundaries: Partner effectively with diverse internal and external teams. Exceptional communication and collaboration skills are key to integrating AON seamlessly into the broader profiling and ML ecosystem.
What we need to see:
BS or MS degree or equivalent experience in Computer Engineering, Computer Science, or related degree.
6+ years of meaningful software development experience in C, C++, and Python
6+ years in system software design, operating systems fundamentals, computer architectures, performance analysis, and delivering production-quality software.
Strong interpersonal, verbal, and written communication, demonstrating the ability to build cross-organizational partnerships and lead technical teams through complex challenges.
Profiling & Performance Tools Expert: Extensive knowledge of profiling technologies (sampling, tracing), overhead analysis, and diverse profiling data (CPU/GPU events, performance counters, API traces, event correlation). Familiarity with existing profiling ecosystems and their limitations is a plus.
GPU & CUDA Proficiency: In-depth knowledge of CUDA APIs, runtime, streams, kernels, and GPU architecture.
ML Ecosystem & Performance Analysis: Familiarity with ML frameworks such as PyTorch and JAX, and knowledge of performance analysis for AI training/inference applications.
Large-Scale System Development & Debugging: Experience developing and debugging across complex multi-layered software systems, including user mode and kernel drivers, with a proven ability to contribute to and extend substantial codebases (100s of millions of lines).
Proficiency in Designing APIs and Interfaces for Profiling Tools: Designs robust, flexible APIs and interfaces enabling seamless integration of profiling tools with various frameworks and custom code.
Mastery of Problem Simplification: A history of breaking down ill-defined problems in complex technical domains, designing effective solutions, and leading teams to implement them.
Ways to stand out from the crowd:
Pioneering Low-Overhead Profiling Systems: A track record of designing and implementing profiling systems with minimal performance impact on target workloads, especially in complex multi-process and distributed environments.
Deep Understanding of PyTorch Internals & CUDA Usage: A comprehensive grasp of how PyTorch uses CUDA, including tensor memory, operations, and distributed training functionalities.
GPU Performance Analysis & Optimization Acuity: The ability to analyze profiling data and translate it into concrete, actionable insights, particularly within CUDA and ML Frameworks like PyTorch.
Translating Customer Needs: Skilled at redefining customer requests into actionable use cases and requirements.
Strong understanding of system security principles.
You will also be eligible for equity and .

NVIDIA is looking for outstanding software engineers to help us expand our enterprise GPU management and monitoring tools. In this role you will work closely with the broader NVIDIA team to design and build Linux-based management agents, CLI tools and end-to-end integration solutions that combine GPUs with the rest of the data center software management ecosystem. You will also help maintain our containerized build environment, build process, CI/CD pipelines and infrastructure, and packaging.
We are focused on supporting NVIDIA products across HPC, cloud and enterprise on both bare metal and virtualized platforms as the role of GPUs in all of these environments expands rapidly. Your contributions will span many aspects of GPU system integration, including telemetry and metrics, health checks, diagnostics, configuration, accounting and policy. These tools fill roles of both passive background monitoring and active online management with a core emphasis on operational transparency and seamless integration in customer environments. Your code will support single node developer systems through large clusters with thousands of nodes. To be successful you will need to have a strong Linux C/C++ background, familiarity with distributed software development and a proven work ethic. You will be expected to jump in quickly and provide important contributions from day one. This is a dynamic work environment with many exciting opportunities awaiting. NVIDIA GPUs are central to many hot trends in the enterprise, cloud and datacenter. Come join us as we craft the future of accelerated compute and AI.
What you'll be doing:
Develop robust, scalable C++ user space data center management system software under Linux
Build and maintain user-space libraries, agents, plugins, bindings and CLI tools
Enable GPU management integration with the OSS ecosystem, including Kubernetes and Docker
Maintain build and CI/CD processes to deliver our product on CUDA-supported OSes.
Support internal and external users through bug fixes, documentation and feature improvements
Maintain high quality products through robust test coverage and smart design
What we need to see:
BS or higher in Computer Science or equivalent experience.
5+ years of meaningful industry experience with a strong C++ development background
User space development and debugging expertise under Linux environments
Experience packaging software for Linux package managers (DEB and RPM)
Experience using Kitware utilities to manage builds (CMake, CPack, CTest)
Experience with APIs and interface design
Outstanding written and verbal interpersonal skills. Strong motivation and commitment to learn new skills
Ability to execute all aspects of the software development lifecycle. Ability to manage time in a fast, heavily multitasked environment
Ways to stand out from the crowd:
Development experience with python, go, and rust. Experience developing CI/CD pipelines using GitLab-CI, GitHub Actions, or Jenkins
Experience developing containerized environments using Docker (buildx, bake, BuildKit), Exposure to GPU programming with CUDA
Experience developing playbooks, roles, and modules for Ansible configuration. Experience with RESTful web services using CLI tools
You will also be eligible for equity and .

NVIDIA is seeking a Sr. Systems Software Engineer for the Apache Spark Acceleration group. Over the past five years GPU accelerated data processing has moved from proof of concept to production deployments. Many enterprises are now recognizing the needs of accelerated computing to handle their large data processing needs. Multi-node GPU deployments will reduce cloud computing costs and lower latency batch ETL workloads.
At NVIDIA, we have been invested in accelerating Apache Spark, providing an open source plugin for Apache Spark. Apache Spark is the most popular data processing engine in data centers. We strive to accelerate Spark applications on GPUs without any code changes. We are passionate about working on hard problems that have an impact. You will need to have strong programming skills, a deep understanding of software development related to C++. You will work with a team that is using open source libraries like RAPIDS to accelerate reading, writing and batch data operations in Spark.
What you'll be doing:
Develop CUDA/C++ libraries to accelerate DataFrames and I/O operations on common file formats such as Parquet, ORC and JSON
Collaborate with distributed systems teams to craft solutions to distributed processing problems challenges at large scale
Work with open source communities to enhance libraries like RAPIDS, CCCL and UCX through technical discussion and code contributions
Provide recommendations and feedback to teams regarding decisions surrounding topics such as infrastructure, continuous integration and testing strategy
Build, test and optimize CUDA/C++ libraries across different platforms
What we need to see:
BS, MS, or PhD in Computer Science, Computer Engineering, or closely related field (or equivalent experience)
12+ years of work experience in software development
Outstanding technical skills in designing and implementing high-quality distributed systems
Excellent programming skills in C++, Java, and/or Scala
Ability to work with teams across organizational boundaries and geographies
Highly motivated with strong interpersonal skills
OS kernel dev experience is a strong plus
You will also be eligible for equity and .
משרות נוספות שיכולות לעניין אותך