In this role you will be responsible for developing, debugging and maintaining software to operate a large compute fleet. You will:- Automate operational processes via services and tools- Develop within configuration management and fleet orchestration via SaltStack, Ansible, Puppet, or other orchestration tools- Design, implement, and maintain robust, scalable, and highly available services that support infrastructure management- Develop and work with large scale Kubernetes clusters- Monitor on-server system performance, identify bottlenecks, and implement solutions to improve efficiency- Conduct root cause analysis for on-server system failures and implement preventive measures- Write and review code, generate and review design documentation- Participate in qualifications and rollouts of software to production clusters