Finding the best job has never been easier
Share
as developing
What you will be doing:
Build tools and frameworks that provide real time application performance metrics that can be correlated with system metrics
Develop automation frameworks that empower applications to thoughtfully predict and overcomesystem/infrastructurefailures, ensuring fault tolerance.
Collaborate with software teams to pinpoint performance bottlenecks. Design, prototype, and integrate solutions that deliver demonstrable performance gains in production environments.
Adapt and enhance communication libraries to seamlessly support innovative network topologies and system architectures.
Design or adapt optimized storage solutions to boost Deep Learning efficiency, resilience, and developer productivity.
What We Need to See:
BS/MS/PhD (or equivalent experience) in Computer Science, Electrical Engineering or a related field.
Proven experience in least one of the following area:
10+ years of experience in analyzing and improving performance of training applications using PyTorch or similar framework
10+years of experience with building distributed software applications
10+years of experience in building storage solutions for Deep Learning applications
10+ years of background in building automated fault tolerant distributed applications
5+ years building tools for bottleneck analysis and automation of fault tolerance in distributed environments.
Strong background in parallel programming and distributed systems
Experience analyzing and optimizing large scale distributed applications.
Excellent verbal and written communication skills
Ways To Stand Out From The Crowd:
Deep understanding of HPC and distributed system architecture with emphasis on RDMA
Hands on working experience in more than one of the above areas especially with performance analysis and profiling of Deep Learning workloads.
Comfortable navigating and working with the PyTorch codebase.
Proven understanding of CUDA and GPU architecture
You will also be eligible for equity and .
These jobs might be a good fit