Your Role and Responsibilities
As a Software Performance Analyst/Tester you will design and create benchmarks and stress workloads, execute performance measurements with those benchmarks and workloads, analyze the performance measurements, and guide Hardware, Operating System, and Software Development teams on performance improvements collaborating with a global IBM team.- Design and implement benchmarks and stress workloads for AIU IO Card and keep them current
- Set up benchmarks and stress workloads and the underlying AIU IO Card configuration for different performances.
- Automate performance measurements and data collection for benchmarks and stress workloads
- Develop and enhance data collection and analysis tools
- Execute performance benchmarks and stress workload
- Analyze the performance measurements and collected data for performance issues and bottlenecks
- Guide Development teams across the stack (IBM Z Hardware/IBM Research, IBM AIU application stack, Middleware/Applications) on the fixing of performance issues due to configurations.
- USP : Working on Systems which consist of AIU IO cards which are developed for creating a platform for training and analysis of IBM Large Language Models for Generative AI.
Required Technical and Professional Expertise
As Individual Developer
- Overall experience of 6-10 years in performance measurement, analysis and system testing skills
- Bachelors degree in Computer science, Information Science.
- Basic ML/AI model architecture, training, and inferencing knowledge
- 3 years experience with Pytorch
- Source code repository systems (e.g. git), scripting language and test automation skills
- Linux administration basic skills
- Experience working with (docker/podman) containers
- Programming languages: Python, C/C++, Bash
- English (fluent) language skills
Keywords: System Testing AND Performance analysis/measurement AND PyTorch AND DOcker/Podman AND Python, c/c++, Bash AND Linux Skills
Preferred Technical and Professional Expertise
Master
- Know-how in AI transformer model design or modification
- Experience in TensorFlow and model inference serving (TF serving, Nvidia Triton, vLLM)
- Experience with hardware design and debugging skills
- Experience in performance profiling (Linux perf) and tracing
- Linux administration advanced skills
- AI accelerator hardware architecture knowledge (GPU, TPU, AMX)
- Programming languages: Cuda, Java