Support the AI/ML cluster infrastructure on both GPU and Dojo platforms, focusing on systems automation, configuration management and deployment at scale
Improve our monitoring & self-healing pipelines, as well as security posture
Optimize our server, storage and network performance
Develop new tools in Python, Golang or Bash/Shell
Use Infrastructure as Code best practices
Participate in 24x7 on-call rotation
What You’ll Bring
Proficiency in Python, Golang and/or Bash
Proficiency with Linux fundamentals and performance optimizations
Experience with configuration management software (Ansible, etc.), systems monitoring & alerting (Prometheus, Grafana, Telegraf, Splunk, etc.)
Experience with containerization technologies such as Kubernetes
Experience with high-throughput low-latency networks, GPU-based computing systems, and/or high-performance storage systems is a plus
Experience with Slurm, LSF and storage management of parallel file systems is a plus
Bachelor's Degree in Computer Science, Computer Engineering, Electrical Engineering, Physics or proof of exceptional skills in related field
3+ years of additional equivalent experience or evidence of exceptional ability related to the position