Designing, implementing and maintaining large scale HPC/AI clusters with monitoring, logging and alerting. Managing Linux job/workload schedules and orchestration tools. Developing and maintaining continuous integration and delivery pipelines. Developing tooling...