In this role you will be responsible for developing, debugging and maintaining software to operate a large scale compute infrastructure. You will:- Write software to automate operations processes by developing services and tools- Develop configuration management, and fleet orchestration solutions powered via SaltStack, Ansible, Puppet, or others- Design, implement, and maintain robust, scalable, and highly available services that support infrastructure management- Monitor on-server system performance, identify bottlenecks, and implement solutions to enhance efficiency- Conduct root cause analysis for on-server system failures and implement preventive measures- Write and review code, generate and review design documentation- Participate in qualifications and rollouts of software to production clusters