Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Microsoft Principal High Performance Computing HPC 
United States 
610905096

11.06.2024

As a, you would provide the best practices driving architectural changes. You will also influence the roadmap of relevant software and hardware components. Your work will directly impact on the business goals of a wide range of users and facilitate the next wave of growth and innovation in AI, and HPC in the cloud in general.


At supercomputing scale, novel tools and techniques are needed to maintain the reliability, runtime performance, health of the system and running jobs continuing to meet the expectations of users. The responsibilities of this position would be to use state-of-the-art methods, design, build and validate novel tools, find operational gaps and instrument features to achieve the smooth operation of cloud-native supercomputers.

Required Qualifications:

  • Bachelor's Degree in Computer Engineering, Electrical Engineering, or related technical field AND 6+ years technical engineering experience in software design and developement, with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
    • OR equivalent experience
  • 3+ years of experience in Power Architecture
  • 3+ years of experience in running and analyzing HPC or AI applications on clusters
  • 3+ years familiarity with HPC environments and systems

Other Requirements:

  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:
    • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Preferred Qualifications:

  • Master's or PhD in Computer Science, Electrical Engineering, or related areas
  • Exposure to operational challenges of running HPC systems (availability, fault tolerance) and mitigation mechanisms
  • Previous experience with running and troubleshooting machine learning workloads on GPU clusters
  • Exposure to Cloud Computing, Virtualization and Container Technologies
Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:
Microsoft will accept applications for the role until July 10, 2024.
Responsibilities

You will join a team of engineers and researchers with experience in high performance computing infrastructure, acutely familiar with the behavior of bulk synchronous loads in large scale systems, middleware, and software. The following values drive us:

  • Drive for Results: We’re here to build great products. We take on whatever work is right for the product and strive for the best possible results.
  • Modesty and Adaptability: The right answer is more important than being right. We search for solutions as a team, adapt quickly and value transparent and open feedback.

Your mission will be to help ensure the Azure platform is consistent on power, performance, can scale on-demand, and engineered to withstand unparalleled computing demand from the customer workloads. You will help build a test-driven engineering culture to reduce regressions and bugs in production and will set a higher bar for infrastructure quality. In addition to the below responsibilites:

  • Manages, oversees, provides guidance to, and reviews the work of individual contributors and people managers to accomplish operational plans and results.
  • Provides oversight and support to the Cloud Operations and Innovation group in developing and implementing programs; reports on policy issues regarding current and long-range planning, advising the Azure HPC leadership and recommending solutions.
  • Develops, coordinates, communicates, and implements procedures for reviewing all solutions to drive power design discussions within Azure AI+HPC.
  • Presents to council, boards and leadership forums, and customers and represents the Azure HPC at meetings and events; attends evening meetings and events based on organizational responsibilities and/or requirements.
  • Evaluates policies/ideas for necessary updates, changes, and additions; develops and recommends options and implementation plans of power stabilization features.
  • Develops best practices to operate and monitor supercomputers running complex workloads.
  • Identifies, tracks, and assesses features to manage power draw or manage power swings in GPU hardware, rack-level instruments or datacenters; compiles and submits data, analyses, and reports.
  • Coordinates with department and leadership to create and implement the annual work program, including assignments to staff and participation in software development and review process.
  • Ensures resolution of problems and controversial or difficult technical issues by working with other employees, departments, architects, datacenter teams, software developers, and product/program managers.
  • Assigns, manages, or conducts special studies pertaining to planning and zoning.
  • Establishes and maintains effective working relationships with those interacted with during work regardless of race, color, religious creed, national origin, ancestry, sex, sexual orientation, gender identity, age, genetic information, disability, political affiliation, military service, or diverse cultural and linguistic backgrounds.
  • Reviews and evaluates work methods and procedures and meets with management staff to identify and resolve problems.
  • Assesses and monitors workload; identifies opportunities for improvement and implements changes.
  • Selects, trains, motivates, and evaluates employees; provides or coordinates staff training; works with employees to correct deficiencies; implements discipline procedures per established policies, procedures, and executive guidance.
  • Oversees and participates in the development and administration of the departmental budget; approves the forecast of funds needed for staffing, equipment, materials, and supplies; approves expenditures and implements budgetary adjustments as appropriate and necessary.
  • Embody our culture and values