

Share
These jobs might be a good fit

Share
This position requires the incumbent to have a sufficient knowledge of English to have professional verbal and written exchanges in this language since the performance of the duties related to this position requires frequent and regular communication with colleagues and partners located worldwide and whose common language is English.

Share
This position requires the incumbent to have a sufficient knowledge of English to have professional verbal and written exchanges in this language since the performance of the duties related to this position requires frequent and regular communication with colleagues and partners located worldwide and whose common language is English.

Share
computing for more than 25 years.a unique legacy of innovationfueled by great technology—and amazing people. Today,
You will define how AI models are deployed and scaled in production using the NVIDIA Spectrum-X Networking Platform, influencing decisions from inter-node communication and
Be Doing:
Lead research and development of end-to-end networking solutions for distributed AI training and inference at scale, with a focus on job completion time, failure resiliency, telemetry, scheduling, andplacement.
Analyze current deployments, develop prototypes, and recommend architectural improvements.
Stay abreast of the latest research; become the team’s authority in emerging networking techniques and technologies.
Design, simulate, and validate new systems using novel, scalable network simulator NSX.
Develop and test prototypes on large-scale GPU clusters (e.g., Israel-1).
Collaborate across hardware, firmware, and software teams to translate ideas into real networking product features.
Publish patents and present research at leading conferences.
What We Need to See:
M.Sc. or PhD (preferred) in Computer Science, Electrical/Computer Engineering, or related field—or B.Sc. with research experience andpublications.
5+ years of relevant experience.
Deep expertise in networking and communication internals (NCCL, RDMA, congestion control, routing).
Strong software engineering skills in C++ and/or Python.
Excellent system-level design and problem-solving abilities.
Outstanding communication and collaboration skills across technical domains.
Ways to Stand Out from the Crowd:
Proven passion for solving sophisticated technical problems and delivering impactful solutions.
Record of publications in top-tier conferences.
Experience in designing and building large-scale AI training clusters.
Post-PhD research experience
Practical understanding of deep learning systems, GPU acceleration, and AI model execution flows.

Share
What you’ll be doing:
What we need to see:
Ways to stand out from the crowd:

Share
What you’ll be doing:
Crafting and developing enterprise-grade systems with a strong focus on scalability, reliability, and performance.
Building and optimizing microservices-based architectures using Kubernetes and cloud-native technologies.
Collaborating closely with backend engineers, product managers, and other partners to deliver impactful solutions.
Writing clean, maintainable, and testable code in Go, contributing to our CI/CD pipelines.
Conducting code and build reviews to uphold high-quality standards and mentor team members.
Leading the development and implementation of advanced identity management systems that secure NVIDIA’s innovative AI and GPU cloud.
Developing scalable multi-tenant solutions that allow our diverse clientele to harness the power of NVIDIA’s platforms securely and efficiently.
Collaborating with multi-functional teams to integrate identity and access management features seamlessly into our products, from cloud services to edge computing devices.
What we need to see:
B.Sc. in Computer Science or a related field (or equivalent experience).
5+ years of experience
Experience in backend software development, including system design and architecture.
Proficiency in at least one backend programming language (Go preferred).
Strong knowledge in microservices architecture, RESTful APIs, and relational databases.
Proficient knowledge of security guidelines and experience applying them in large-scale systems.
Expertise in implementing OAuth, OIDC, SAML, and other modern authentication protocols - Advantage
Ways to stand out from the crowd:
Expertise in Kubernetes internals and advanced cloud-native technologies.
Experience working in Linux environments with knowledge of networking, security, and virtualization.
Contributions to open-source projects or active participation in tech communities.
Agile approach and familiarity with standard methodologies.

Share
What you'll be doing:
The person will be part of the NVIDIA AIR team that is building the SaaS/IaaS platform for digital twin of AI data centers.
The responsibility specifically is for DevOps, infrastructure and Site Reliability Engineering (SRE) requirements for AIR.
Focus on efficiency by automating repetitive workflows.
Working on microservices based architecture.
Deploying and troubleshooting non-disruptive cloud operations with an emphasis on secure production infrastructure.
Continuous evaluation of existing system and driving improvements.
Managing deployment/upgrade for Operating Systems, Kubernetes(k8s) clusters and/or or other orchestration tools.
Day to day support for engineering activities with CI/CD tools like git, Jenkins.
Efficiently multi-tasking on the different tracks to efficiently address evolving priorities .
What we need to see:
BSc in Engineering/ Relevant Certifications/ equivalent experience.
5+ years of experience in complex microservices basedarchitectures
Highly skilled in Kubernetes and Docker
Experience in IaaS environment - deploying, configuring, and administering Linux-based bare metal servers
Strong networking background (VLANs, routing, VPNs)
Experience with relational databases(MySQL) and SQL.
Experienced with modern deployment architecture for non-disruptive cloud operations including blue green and canary rollouts
Infrastructure as code (IaC) skills in frameworks like Ansible & Terraform
Expert in AWS
Knows best practices and discipline of managing and monitoring a highly available and secure production infrastructure
Ways to stand out from the crowd:
Strong expertise in Infrastructure as a Service (IaaS)
Skills in Linux/Unix Administration
Experience with Prometheus/Grafana.
Experience with APM tools like Dynatrace, Datadog, AppDynamics, New Relic, etc.
Implemented robust metrics collection and alerting

These jobs might be a good fit