Expoint – all jobs in one place
מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר
Limitless High-tech career opportunities - Expoint

Nvidia Senior Software Architect Always-On Profiling 
United States, California 
237524215

28.07.2025
US, CA, Santa Clara
time type
Full time
posted on
Posted 2 Days Ago
job requisition id

What you’ll be doing:

  • Architect and Build Scalable Systems: Drive the design and implementation of the AON profiling service's core systems. You'll master inter-process communication (IPC), memory management, and building low-overhead architectures to handle profiling data from complex multi-node, multi-process, multi-GPU, and cluster environments.

  • Elevate Software Engineering Excellence: Promote high standards in software development, including design patterns, concurrency, parallelism, and advanced debugging for asynchronous systems. Our commitment to code quality and robust testing ensures a reliable profiling service.

  • Lead, Mentor, and Innovate: Guide and mentor engineers, provides impactful code reviews, and shape technical roadmaps. Proactively identify complex technical issues within the AON project, break them down, and craft innovative solutions. Your problem-solving prowess will be crucial for AON's success with ML workloads.

  • Drive Full-Stack Development: Transform user needs into clear requirements and design documents. Explore diverse approaches to problems, making well-reasoned recommendations. Lead end-to-end feature development—from planning and prototyping to implementation, testing, and customer evaluation. This involves hands-on development across user applications, drivers, performance counter libraries, and lower-level platform/hardware abstraction layers.

  • Collaborate Across Boundaries

What we need to see:

  • BS or MS degree or equivalent experience in Computer Engineering, Computer Science, or related degree.

  • 8+ years of meaningful software development experience in C, C++, and Python

  • 10+ years in system software design, operating systems fundamentals, computer architectures, performance analysis, and delivering production-quality software.

  • Strong interpersonal, verbal, and written communication, demonstrating the ability to build cross-organizational partnerships and lead technical teams through complex challenges.

  • Profiling & Performance Tools Expert: Extensive knowledge of profiling technologies (sampling, tracing), overhead analysis, and diverse profiling data (CPU/GPU events, performance counters, API traces, event correlation). Familiarity with existing profiling ecosystems and their limitations is a plus.

  • GPU & CUDA Proficiency: In-depth knowledge of CUDA APIs, runtime, streams, kernels, and GPU architecture.

  • ML Ecosystem & Performance Analysis:Familiarity with ML frameworks such as PyTorch and JAX, and knowledge of performance analysis for AI training/inference applications.

  • Large-Scale System Development & Debugging: Experience developing and debugging across complex multi-layered software systems, including user mode and kernel drivers, with a proven ability to contribute to and extend substantial codebases (100s of millions of lines).

  • Proficiency in Designing APIs and Interfaces for Profiling Tools: Designs robust, flexible APIs and interfaces enabling seamless integration of profiling tools with various frameworks and custom code.

  • Mastery of Problem Simplification

Ways to stand out from the crowd:

  • Pioneering Low-Overhead Profiling Systems: A track record of designing and implementing profiling systems with minimal performance impact on target workloads, especially in complex multi-process and distributed environments.

  • Deep Understanding of PyTorch Internals & CUDA Usage: A comprehensive grasp of how PyTorch uses CUDA, including tensor memory, operations, and distributed training functionalities.

  • GPU Performance Analysis & Optimization Acuity: The ability to analyze profiling data and translate it into concrete, actionable insights, particularly within CUDA and ML Frameworks like PyTorch.

  • Translating Customer Needs: Skilled at redefining customer requests into actionable use cases and requirements.

  • Strong understanding of system security principles

You will also be eligible for equity and .