Our solutions development engineers design and develop AI and GenAI solutions integrating GPU accelerated compute, network and storage products, in Slurm and Kubernetes clustered environments, allowing data scientists and developers to develop, fine tune, customize and deploy (for inference) AI models towards various use-cases and deployment environments (core, edge, cloud and anything in between).
As an Principal Systems Development Engineer, you will be responsible for the engineering work necessary to develop Dell Integrated Solutions based on the Generative AI-based compute, network, and storage infrastructure environment. You will use deep technical and industry expertise in AI to do the deployment, configuration and testing of AI workloads running on these infrastructures in a highly available and scalable clustered configurations.
You will:
- Create Architecture Designs, Write Engineering Functional, Development and Test Plans based on the Generative AI solution/HPC specifications and requirements. Set up and deploy the technical environment necessary to execute all solution developments and testing required, as specified in the Engineering Test Plans.
- Troubleshoot/Resolve technical issues and conduct technical review of all program deliverables prior to completing engineering work. Capture Engineering results to author technical content in form of Design Guide, Technical White Paper, Implementation Guide, Sizing Guide, and Technical Blogs.
- Initiate review and analysis of internally and externally facing engineering-specific documentation with stakeholders. Share innovation and technology knowledge gained from each project to help other engineering peers and leaders gain knowledge, help improve internal processes and future project decisions, and drive ideas for the streamlining and automation of future development work.
- Lead functional responsibilities of Generative AI solution/HPC projects. Consistently communicates status of AI solution development work to relevant stakeholders, especially any issues that put the project timeline and/or quality as risk.
Essential Requirements:
- 8 to 12 years of experience in infrastructure deployment, installation, configuration, troubleshooting of Servers, Storage, and Networking of Linux-based operating systems. .
- Experience in AI Technologies and HPC or high performance computing, including Generative, Cognitive AI framework models.
- Experience in Machine Learning, Deep Learning, AI platforms, libraries like Hugging Face, TensorFlow, and Open-Source ML framework using pytorch, High performance computing software and benchmark.
- Experience in deployment and configuration and networking like IP addressing, subnets, VLANs.
- Experience in cluster management, including deploying and managing containers, Kubernetes and Slurm clusters.
Desirable Requirements
- Experience with Open-Source tool for AI/GenAI/ML/HPC software/solutions benchmark and additional eco-system.
- Have outstanding communication skills and possess the ability to handle multiple simultaneous projects.