Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Microsoft Principal Artificial Intelligence AI Software Engineer 
United States, Washington 
800607742

01.05.2024

At this supercomputing scale, we need specialized tools and techniques to maintain the reliability, runtime performance, health of the system and running jobs continuing to meet the Service Level Agreements of users. Your job would be to use the state-of-the-art tools and techniques, find operational gaps and instrument features to achieve the smooth operation of cloud-native supercomputers.

Required Qualifications:

  • Bachelor's Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, or Python
    • OR equivalent experience.
  • 5+ years of experience in Developing and Running Artificial Intelligence (AI) or High Performance Computing (HPC) applications on clusters or related
  • 5+ years of experience with AI software
  • 2+ years of experience in tuning performance of AI or HPC applications

Other Requirements:

  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:
    • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Preferred Qualifications:

  • Bachelor's Degree in Computer Science or related technical field AND 10+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, or Python
    • OR Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, or Python
    • OR equivalent experience.
  • Hands-on knowledge in Compute Unified Device Architecture (CUDA), C++ Heterogenous-Compute Interface for Portability) HIP, or related parallel programming domains
  • Exposure to operational challenges of running Artificial Intelligence/High Performance Computing (HPC) (Availability, Fault Tolerance) and Mitigation Mechanisms
  • Experience with running and troubleshooting Artificial Intelligence/Machine Learning workloads on clusters
  • Exposure to Cloud Computing, Virtualization and Container Technologies

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:

Responsibilities
  • Design and code solutions that improve the management of remote systems.
  • Leads by example within the team by producing extensible and maintainable.
  • Optimizes, debugs, refactors, and reuses code to improve performance and maintainability, effectiveness, and return on investment (ROI). Applies metrics to drive the quality and stability of code, as well as appropriate coding patterns and best practices.
  • Holds accountability as a Designated Responsible Individual (DRI) and mentors other engineers across products/solutions, working on call to monitor system/product/service for degradation, downtime, or interruptions.
  • Alerts stakeholders as to status and initiates actions to restore system/product/service for complex issues.
  • Develops a playbook for the team to resolve issues.
  • Coordinates people and resources to ensure DRI responsibilities are covered across teams.
  • Keep infrastructure services running and deliver code updates on a regular cadence to improve performance and reliability.
  • Maintains communication with key partners across the Microsoft ecosystem of engineers.
  • Acts as a key contact for leadership to ensure alignment with partners' expectations. Considers partner teams across own organization and their end goals for products to drive and achieve desirable user experiences and fitting dynamic needs of partners/customers through product development.
  • Dedicated to the mission to help ensure Azure platform is consistent on performance, can scale on-demand, and engineered to withstand the unparalleled computing demand from the customer workloads.
  • Help build a test-driven engineering culture to reduce regressions and bugs in production and will set a higher bar for infrastructure quality.