Design and develop software components for our datacenter control plane, including: resource management and allocation; provisioning & automation of infrastructure components; monitoring and management of datacenter resources; integration with distributed storage systems
Collaborate with cross-functional teams to ensure seamless integration of our software with our datacenter infrastructure
Develop and maintain code for infrastructure software, focusing on areas such as: scalability & performance optimization; availability, reliability, & fault tolerance; automation & orchestration of datacenter operations
Work closely with the Operations team to ensure smooth deployment and operation of infrastructure software
Participate in the testing and validation of infrastructure software to ensure it meets quality and reliability standards
Collaborate with other Engineers to identify and resolve technical issues, and to continuously improve the design and operation of our datacenter infrastructure
What You’ll Bring
Degree in Computer Science, Electrical Engineering, or related field or equivalent experience
5+ years of experience in software development, with a focus on infrastructure software and datacenter operations
Strong programming skills in languages such as Python, Go, or Bash
Experience with Slurm resource management and job scheduling systems
Experience with distributed storage systems, including Ceph, Gluster, or other similar technologies
Strong understanding of system design principles, including scalability, availability, and reliability
Experience with agile development methodologies and version control systems such as Git
Excellent problem-solving skills, with the ability to analyze complex technical issues and develop creative solutions
Strong communication and collaboration skills, with the ability to work effectively with cross-functional team
Knowledge of containerization technologies, such as Docker or Kubernetes preferred