Finding the best job has never been easier
Share
Key job responsibilities
- Develop and automate tools and frameworks for running training and inference workloads, as well as collecting power and performance metrics.
- Use programming and scripting languages (C/C++, Python, Bash, etc.) to create efficient workflows for automating the execution of complex workloads and data collection processes.
- Perform in-depth data analysis of collected power and performance metrics, identifying key trends, bottlenecks, and opportunities for optimization.- Build and maintain interactive dashboards and data visualization tools to communicate insights from performance and power data clearly and effectively to technical and non-technical stakeholders.
- Contribute to the creation of end-to-end automated pipelines for continuous power and performance testing, monitoring, and reporting.
- Design and implement data validation processes to ensure the accuracy and integrity of collected power and performance metrics across various workloads.
- Perform root cause analysis of performance and power inefficiencies, using custom scripts and tools to debug and optimize system-level performance.
- Optimize data collection and analysis processes to improve efficiency, scalability, and accuracy in high-performance environments.- Stay current on emerging tools, technologies, and best practices in power and performance analysis, contributing to the evolution of our tool set and frameworks.
Diverse Experiences
AWS values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying.About AWSWork/Life Balance
Mentorship & Career Growth
We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional.
- 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
- Strong experience in programming and scripting (Python, Bash, or similar) to automate data collection, analysis, and reporting tasks.
- Proficiency in developing and maintaining custom tools for power and performance measurement in data center environments.
- Experience with data analysis frameworks (e.g., Pandas, NumPy) and data visualization tools (e.g. Matplotlib, Power BI, Tableau) to analyze and present performance data.
- Strong background in automating system-level performance testing and benchmarking.
- Knowledge of machine learning frameworks (e.g., TensorFlow, PyTorch) and related workloads.
- Familiarity with power management techniques such as dynamic voltage and frequency scaling (DVFS), low-power states, and energy-efficient design.
- Experience with developing automated testing pipelines for continuous integration of power and performance metrics.
- Strong problem-solving skills and the ability to debug complex power and performance issues in large-scale systems.
These jobs might be a good fit