In this role you will be responsible for developing, debugging and maintaining software to operate a large compute fleet. You will:- Automate operations processes via services and tools- Develop within configuration management and fleet orchestration via SaltStack, Ansible, Puppet, or others- Design, implement, and maintain robust, scalable, and highly available services that support infrastructure management- Monitor on-server system performance, identify bottlenecks, and implement solutions to enhance efficiency- Conduct root cause analysis for on-server system failures and implement preventive measures- Write and review code, generate and review design documentation- Participate in qualifications and rollouts of software to production clusters