Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

Microsoft Principal AI/HPC Software Engineer 
United States 
891847492

16.07.2024

We are looking for a Principal AI/HPC Software Engineer who is about quality, wants the customer to succeed and get things done. You will join a phenomenal team of engineers and researchers with deep experience in high performance computing, machine learning, deep learning, middleware, and software engineering. The following values drive us:

  • Drive for Results: We’re here to build great products. We take on whatever work is right for the product and strive for the best possible results.
  • Modesty and Adaptability: The right answer is more important than being right. We search for solutions as a team, adapt quickly and value transparent and open feedback.

Your mission will be to help ensure the Azure platform is consistent on performance, can scale on-demand, and engineered to withstand the unparalleled computing demand from the customer workloads. You will help build a test-driven engineering culture to reduce regressions and bugs in production and will set a higher bar for infrastructure quality.

Required Qualifications:

  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
    • OR equivalent experience
  • 6+ years of experience in software design and development
  • 3+ years of experience in developing and running AI/HPC applications on clusters

Other Requirements:

  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:
    • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Preferred Qualifications:

  • PhD in Computer Science, Electrical Engineering, or related areas
  • Exposure to operational challenges of running HPC systems (availability, fault tolerance) and mitigation mechanisms
  • Previous experience with running and troubleshooting machine learning workloads on GPU clusters is a plus
  • Exposure to Cloud Computing, Virtualization and Container Technologies
  • Familiarity with HPC software stack
Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:
Microsoft will accept applications for the role until July 26, 2024.
Responsibilities
  • Identifies, tracks, and assesses features in parallel programming layers (such as CUDA or HIP C++) to improve throughput or latency on state-of-the-art GPU hardware, rack-level instruments, or datacenters; compiles and submits data, analyses, and reports.
  • Analyzes the runtime profiles or call graphs of parallel programs running synchronously on hundreds to thousands of devices (GPUs) concurrently, analogous to known High Performance Computing simulation workloads (e.g., NAMD, LINPACK, SEISMIC).
  • Develops additional instrumentation in application code to log runtime characteristics if not available in standard tools.
  • Communicates with CPU or GPU architects to understand the intellectual merit, performance characteristics, and overhead or readiness of hardware features and supporting software.
  • Reproduces novel ideas and optimization techniques from published literature to accelerate generative AI training and inferencing; develops proofs of concepts and measures their impact on critical applications' end-to-end runtime.
  • Analyzes overheads and performance characteristics of critical software frameworks (e.g., PyTorch, Nvidia CUDA, AMD HIP) in the end-to-end runtime of generative AI training and inferencing.
  • Manages, oversees, provides guidance to, and reviews the work of individual contributors and people managers to accomplish operational plans and results.

Embody our