We are designing and architecting a comprehensive platform that automates GPU asset provisioning, configuration, and lifecycle management across cloud providers.
Design, develop, test, debug, and optimize creative solutions for Datacenter firmware throughout lifecycle.
Work closely with hardware, software, infrastructure, and business teams to transform new firmware features from idea to reality.
Define server-level reliability, availability, and serviceability requirements in collaboration with various customers like CSPs and deliver fault resilient solution at scale as per customer expectations.
Collaborate with hardware, software and firmware teams to drive failure analysis and large scale solution deployment.
Work with engineering teams across NVIDIA to ensure your software integrates seamlessly from the hardware all the way up to the AI training applications.
What we need to see:
Currently pursuing a Bachelor's, Master's, or PhD degree within Computer Engineering, Electrical Engineering, Computer Science, or a related field
Course or internship experience related to the following areas required: Computer Architecture, Deep Learning or Machine Learning, GPU computing and Parallel Programming, Performance Modeling, profiling, optimizing, and/or analysis.
Prior experience or knowledge required on the following programming skills and technologies: C, C++, Python, Perl, GPU Computing (CUDA, OpenCL, OpenACC), Deep Learning Frameworks (PyTorch, TensorFlow, Caffe), HPC (MPI, OpenMP)