דרושים Master Scheduler Npi Board Systems ב-אנבידיה ב-ארהב

time type: Full time

posted on: Posted 6 Days Ago

job requisition id

What you’ll be doing:

Contribute features to vLLM that empower the newest models with the latest NVIDIA GPU hardware features; profile and optimize the inference framework (vLLM) with methods like speculative decoding,data/tensor/expert/pipeline-parallelism,prefill-decode disaggregation.
Develop, optimize, and benchmark GPU kernels (hand-tuned and compiler-generated) using techniques such as fusion, autotuning, and memory/layout optimization; build and extend high-level DSLs and compiler infrastructure to boost kernel developer productivity while approaching peak hardware utilization.
Define and build inference benchmarking methodologies and tools; contribute both new benchmark and NVIDIA’s submissions to the industry-leading MLPerf Inference benchmarking suite.
Architect the scheduling and orchestration of containerized large-scale inference deployments on GPU clusters across clouds.
Conduct and publish original research that pushes the pareto frontier for the field of ML Systems; survey recent publications and find a way to integrate research ideas and prototypes into NVIDIA’s software products.

What we need to see:

Bachelor’s degree (or equivalent expeience) in Computer Science (CS), Computer Engineering (CE) or Software Engineering (SE) with 7+ years of experience; alternatively, Master’s degree in CS/CE/SE with 5+ years of experience; or PhD degree with the thesis and top-tier publications in ML Systems, GPU architecture, or high-performance computing.
Strong programming skills in Python and C/C++; experience with Go or Rust is a plus; solid CS fundamentals: algorithms & data structures, operating systems, computer architecture, parallel programming, distributed systems, deep learning theories.
Knowledgeable and passionate about performance engineering in ML frameworks (e.g., PyTorch) and inference engines (e.g., vLLM and SGLang).
Familiarity with GPU programming and performance: CUDA, memory hierarchy, streams, NCCL; proficiency with profiling/debug tools (e.g., Nsight Systems/Compute).
Experience with containers and orchestration (Docker, Kubernetes, Slurm); familiarity with Linux namespaces and cgroups.
Excellent debugging, problem-solving, and communication skills; ability to excel in a fast-paced, multi-functional setting.

Ways to stand out from the crowd

Experience building and optimizing LLM inference engines (e.g., vLLM, SGLang).
Hands-on work with ML compilers and DSLs (e.g., Triton,TorchDynamo/Inductor,MLIR/LLVM, XLA), GPU libraries (e.g., CUTLASS) and features (e.g., CUDA Graph, Tensor Cores).
Experience contributing tocontainerization/virtualizationtechnologies such ascontainerd/CRI-O/CRIU.
Experience with cloud platforms (AWS/GCP/Azure), infrastructure as code, CI/CD, and production observability.
Contributions to open-source projects and/or publications; please include links to GitHub pull requests, published papers and artifacts.

You will also be eligible for equity and .

משרות נוספות שיכולות לעניין אותך

Nvidia Senior Applied AI Software Engineer Distributed Inference Sy... United States, Texas

Bank Of America Senior Engineer-AI Inference United States, Texas, Addison

Bank Of America Senior Engineer-AI Inference United States, Texas, Addison

Red hat Senior Software Engineer Test AI Inference United States, Massachusetts, Boston

Yesterday

Nvidia DGX NPI System Product Development Engineer United States, California

שיתוף

time type: Full time

posted on: Posted 2 Days Ago

job requisition id

What you'll be doing:

Drive development and productization of NVIDIA’s DGX datacenter products and L11 systems.
Lead debug efforts for L11 rack-level integration, creating and applying tools/scripts for failure identification and root cause analysis.
Provide clear, actionable guidance to factories to resolve issues quickly and implement corrective actions that improve manufacturing quality and efficiency.
Develop and document robust, stable recipes—including diagnostics, firmware, and software—for mass production ramp.
Review and provide feedback on test plans and factory acceptance criteria, focusing on yield, quality, and efficiency.
Collaborate with development, validation, and manufacturing test engineering teams to understand diagnostic and firmware release plans and their impact on system stability and quality.
Influence test engineering and diagnostic teams to enhance test methodology, telemetry, and debug capabilities for precise FRU identification and isolation of firmware/test issues and to improve test coverage with the goal of preventing downstream escapes.
Partner with product development teams to ensure designs are optimized for manufacturing.
Present NPI status updates and critical issues to executive management.

What we need to see:

Expertise in server platform architecture, CPU/GPU baseboards, and high-speed interfaces with proven system-level debug skills and exceptional diagnostic instincts.
Strong knowledge of BMC, firmware architecture, and manufacturing diagnostics.
Familiarity with L11 integration processes.
Excellent communication skills to articulate problems and deliver clear recommendations.
Strong analytical skills to synthesize complex information and provide actionable guidance.
Leadership skills to manage factory operations and drive issue resolution.
Collaborative mindset to work seamlessly with cross-functional teams and external partners.
Results-driven approach to achieving optimal outcomes across all aspects of NPI operations.
12+ years in system engineering, debug, or equivalent relevant experience
BS or higher in Electrical Engineering, Computer Engineering or equivalent experience

You will also be eligible for equity and .

15.11.2025

Nvidia Senior Technical Data Analyst - Operations E2E Intelligent S... United States, California

שיתוף

time type: Full time

posted on: Posted 2 Days Ago

job requisition id

What you'll be doing:

Design intuitive data models and semantic layers to enable self‑service and AI apps reducing ad‑hoc query friction for business users.
Enrich data products with business glossary and metadata to reduce AI hallucinations, improve user adoption, searchability and governance.
Lead multi‑site integrations across new manufacturing plants and ops applications standardizing schemas and controls; enabling cross‑plant insights.
Engineer scalable pipelines with data integrity functions and audit features. Automate measuring and monitoring data quality for improved decision making.
Explain the data designs, system changes, enhancements, address any questions or issues effectively to the stakeholders.
Partner with stakeholders, solve business problems, train users, help with data and queries.
Optimize Lakehouse systems to deliver high performing solutionswhile controlling operational costs.

What we need to see:

BS, MS, or PhD in EE/CS or related field of education (or equivalent experience).
5+ years of programming experience (Python, PySpark, SQL, etc.).
5+ years of experience with big data technologies and cloud platforms (AWS, Databricks, Snowflake).
12+ overall years in Data Warehousing, implementing projects with data Lakehouse solutions.
Experience with enterprise BI databases like SAP BW/HANA, ERP/CRM systems like SAP/Salesforce, planning applications like IBP, APO etc.
Knowledge of operational processes in chips, boards, systems, and networking.
Proficiency in Tableau, PowerBI, and SAP reporting applications.

Ways to stand out from the crowd:

Strong analytical skills with the ability to collect, organize, and disseminate significant amounts of information with attention to detail and accuracy.
Highly independent, able to lead key technical decisions, influence project roadmap and work effectively with team members
Proven experience leading multiple analytics projects in a dynamic, fast-paced environment
Data science, AI/ML experience
Positive interpersonal skills with ability to convey good verbal and written communication

You will also be eligible for equity and .

15.11.2025

Nvidia Senior Systems Engineer – High-Performance AI Networking App... United States, Texas

שיתוף

US, WA, Remote

US, CA, Remote

time type: Full time

posted on: Posted 6 Days Ago

job requisition id

What you will be doing:

Collaborate with networking teams to plan, implement, and evaluate performance benchmarks on NVLINK, NVSwitch, and InfiniBand powered infrastructures.
Assess findings and work closely with framework, hardware, and support teams to improve system performance across various deep learning workloads.
Act as a primary resource for fixing networking and hardware integration issues, focusing on scalable multi-node systems.
Maintain high communication standards across multiple engineering, support, and R&D teams, ensuring technical and performance goals are met.
Offer technical mentorship and documentation for internal teams and external partners on standard methodologies in HPC networking deployments.
Share insights on improving networking strategies for substantial AI and deep learning infrastructure.

What we need to see:

BS/MS or PhD in Computer Science, Engineering, or related field, or equivalent experience.
8+ years of proven experience in AI/HPC Infrastructure.
Familiarity with AI/HPC job schedulers and orchestrators like Slurm, K8s, or LSF. Practical exposure to AI/HPC workflows employing MPI and NCCL.
Familiarity with High-Speed Networking pertaining to HPC including InfiniBand, RDMA, RoCE, and Amazon EFA.
Essential to have an understanding of PyTorch, MegatronLM, and Deep Learning Inference frameworks such as vllm/sglang.
Proven experience with InfiniBand, NVLINK, and high-speed networking technologies in HPC or large-scale datacenter environments.
Investigating and evaluating performance in multi-node systems, especially in deep learning or scientific computing tasks.
Strong analytical, debugging, and technical communication skills.
Comfortable working in collaborative, multi-faceted teams.

Ways to stand out from the crowd:

Mastery in deep learning frameworks or distributed training systems.
Familiarity with datacenter automation, advanced network protocols, and supporting large HPC or AI clusters in production environments.
Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workload.
Experience with networking and communications libraries like NCCL, NIXL, NVSHMEM, UCX.
Experience developing or maintaining cluster management and monitoring tools Ex: ansible for infrastructure as a service, prometheus and grafana for monitoring.

You will also be eligible for equity and .

15.11.2025

Nvidia Master Scheduler NPI Board System United States, California

שיתוף

time type: Full time

posted on: Posted 2 Days Ago

job requisition id

What you'll be doing:

Develop and maintain comprehensive capacity plan for NPI boards and systems.
Identify potential capacity bottlenecks that will affect product bring-up. Collaborate with Contract Manufacturers (CM) and Nvidia cross-functional teams to develop a strategy to proactively solve the constraint.
For future NPI Bring-ups, it is encouraged that this individual will actively engage with teams (both internal and external) to better understand the product and prepare sufficient capacity.
Communicate capacity status, risks, and issues to team members in a clear and concise manner.
Identify avenues to improve current processes for greater efficiency and predictability.

What we need to see:

Bachelor’s degree in Operations/Supply Chain, Industrial Engineering, or a related field (or equivalent experience).
8+ years of experience in production planning, master scheduling or supply chain manufacturing, preferably in the technology sector.
Outstanding organizational and time-management skills.
Possess good reasoning, strong analytical and problem solving skills.
Excellent communication skills, verbal and written
Ability to compete in a fast-paced, dynamic environment.
Strong communication skills and the ability to collaborate with diverse teams.
Advance MS Excel skills with deep understanding of Excel techniques. MS Project, Tableau, or similar data visualization tools a plus.

Ways to stand out from the crowd:

Experience with NPI Planning processes in a tech company.
Knowledge of NVIDIA’s products and technologies.
Demonstrated ability to lead and mentor junior team members.

You will also be eligible for equity and .

10.11.2025

Nvidia Principal Systems Software Engineer United States, Texas

שיתוף

US, IL, Champaign

time type: Full time

posted on: Posted 12 Days Ago

job requisition id

NVIDIA is seeking a Sr. Systems Software Engineer for the Apache Spark Acceleration group. Over the past five years GPU accelerated data processing has moved from proof of concept to production deployments. Many enterprises are now recognizing the needs of accelerated computing to handle their large data processing needs. Multi-node GPU deployments will reduce cloud computing costs and lower latency batch ETL workloads.

At NVIDIA, we have been invested in accelerating Apache Spark, providing an open source plugin for Apache Spark. Apache Spark is the most popular data processing engine in data centers. We strive to accelerate Spark applications on GPUs without any code changes. We are passionate about working on hard problems that have an impact. You will need to have strong programming skills, a deep understanding of software development related to C++. You will work with a team that is using open source libraries like RAPIDS to accelerate reading, writing and batch data operations in Spark.

What you'll be doing:

Develop CUDA/C++ libraries to accelerate DataFrames and I/O operations on common file formats such as Parquet, ORC and JSON
Collaborate with distributed systems teams to craft solutions to distributed processing problems challenges at large scale
Work with open source communities to enhance libraries like RAPIDS, CCCL and UCX through technical discussion and code contributions
Provide recommendations and feedback to teams regarding decisions surrounding topics such as infrastructure, continuous integration and testing strategy
Build, test and optimize CUDA/C++ libraries across different platforms

What we need to see:

BS, MS, or PhD in Computer Science, Computer Engineering, or closely related field (or equivalent experience)
12+ years of work experience in software development
Outstanding technical skills in designing and implementing high-quality distributed systems
Excellent programming skills in C++, Java, and/or Scala
Ability to work with teams across organizational boundaries and geographies
Highly motivated with strong interpersonal skills
OS kernel dev experience is a strong plus

You will also be eligible for equity and .

10.11.2025

Nvidia Senior Quantum Engineer - HPC Systems United States, Texas

שיתוף

time type: Full time

posted on: Posted 9 Days Ago

job requisition id

What You’ll Be Doing:

Take charge of the technical integration of quantum hardware (neutral atom, trapped ion, superconducting) with HPC systems via APIs, middleware, and orchestration layers like CUDA-Q.
Formulate and refine hybrid workflows to enable seamless task distribution between GPU clusters and quantum devices.
Partner closely with quantum hardware suppliers to set up connectivity, control interfaces, and co-design specifications to improve performance, decrease latency, and enable data exchange.
Partner with internal scientists and engineers to install & optimize applications, deploy hybrid workloads, and evaluate system performance.
Work with control systems engineers to ensure environmental, timing, and data interfaces meet quantum hardware requirements.
Prototype and benchmark hybrid applications in materials science, chemistry, optimization, and machine learning to showcase platform capabilities.
Contribute to roadmap planning for adding new quantum modalities (superconducting, photonic) and integrating emerging SDKs.
Represent NVIDIA at technical conferences, workshops, and industry forums, showcasing our advancements and groundbreaking efforts.
Develop comprehensive user documentation and integration guides for internal use and cross-team collaboration.
Drive continuous improvement across software stacks, orchestration layers, and data pipelines connecting quantum and HPC domains.

What We Need to See:

12+ years of experience in HPC system administration, Linux, Slurm, application support, and data management.
Experience with quantum programming frameworks like CUDA-Q, Qiskit, PennyLane, Cirq, Braket, and more.
Proficiency in Python, C++, or Rust for API integration and workflow automation.
Strong understanding of HPC systems, Slurm orchestration, and GPU-accelerated computing environments.
Understanding of quantum hardware systems encompassing neutral-atom, trapped-ion, superconducting, or photonic technologies.
Bachelor’s or Master’s degree or equivalent experience in Physics, Electrical/Computer Engineering, or Computer Science (PhD preferred).
Outstanding communication and collaborator management skills, with the ability to engage both experimental scientists and systems engineers.

Ways to Stand Out from the Crowd:

Demonstrated track record collaborating with quantum hardware providers.
Deep understanding of quantum-classical orchestration frameworks and low-latency data transfer architectures.
Familiarity with cloud-based quantum services and HPC integration standards.
Contributions to open-source quantum frameworks or involvement in academic collaborations.
Success in bridging experimental physics and HPC engineering teams.
Experience representing an organization in technical standards bodies or research consortia.

You will also be eligible for equity and .

NvidiaSenior Software Engineer AI Inference Systems

1 2 3 4 5 6

United States, California

973356704

Yesterday

שיתוף

תיאור:

US, CA, Santa Clara

time type: Full time

posted on: Posted 6 Days Ago

job requisition id

What you’ll be doing:

Contribute features to vLLM that empower the newest models with the latest NVIDIA GPU hardware features; profile and optimize the inference framework (vLLM) with methods like speculative decoding,data/tensor/expert/pipeline-parallelism,prefill-decode disaggregation.
Develop, optimize, and benchmark GPU kernels (hand-tuned and compiler-generated) using techniques such as fusion, autotuning, and memory/layout optimization; build and extend high-level DSLs and compiler infrastructure to boost kernel developer productivity while approaching peak hardware utilization.
Define and build inference benchmarking methodologies and tools; contribute both new benchmark and NVIDIA’s submissions to the industry-leading MLPerf Inference benchmarking suite.
Architect the scheduling and orchestration of containerized large-scale inference deployments on GPU clusters across clouds.
Conduct and publish original research that pushes the pareto frontier for the field of ML Systems; survey recent publications and find a way to integrate research ideas and prototypes into NVIDIA’s software products.

What we need to see:

Bachelor’s degree (or equivalent expeience) in Computer Science (CS), Computer Engineering (CE) or Software Engineering (SE) with 7+ years of experience; alternatively, Master’s degree in CS/CE/SE with 5+ years of experience; or PhD degree with the thesis and top-tier publications in ML Systems, GPU architecture, or high-performance computing.
Strong programming skills in Python and C/C++; experience with Go or Rust is a plus; solid CS fundamentals: algorithms & data structures, operating systems, computer architecture, parallel programming, distributed systems, deep learning theories.
Knowledgeable and passionate about performance engineering in ML frameworks (e.g., PyTorch) and inference engines (e.g., vLLM and SGLang).
Familiarity with GPU programming and performance: CUDA, memory hierarchy, streams, NCCL; proficiency with profiling/debug tools (e.g., Nsight Systems/Compute).
Experience with containers and orchestration (Docker, Kubernetes, Slurm); familiarity with Linux namespaces and cgroups.
Excellent debugging, problem-solving, and communication skills; ability to excel in a fast-paced, multi-functional setting.

Ways to stand out from the crowd

Experience building and optimizing LLM inference engines (e.g., vLLM, SGLang).
Hands-on work with ML compilers and DSLs (e.g., Triton,TorchDynamo/Inductor,MLIR/LLVM, XLA), GPU libraries (e.g., CUTLASS) and features (e.g., CUDA Graph, Tensor Cores).
Experience contributing tocontainerization/virtualizationtechnologies such ascontainerd/CRI-O/CRIU.
Experience with cloud platforms (AWS/GCP/Azure), infrastructure as code, CI/CD, and production observability.
Contributions to open-source projects and/or publications; please include links to GitHub pull requests, published papers and artifacts.

You will also be eligible for equity and .

Expand

משרות נוספות שיכולות לעניין אותך

Nvidia Senior Applied AI Software Engineer Distributed Inference Sy... United States, Texas

Bank Of America Senior Engineer-AI Inference United States, Texas, Addison