Our solutions development engineers design and develop AI and GenAI solutions integrating GPU accelerated compute, network and storage products, in Slurm and Kubernetes clustered environments, allowing data scientists and developers to develop, fine tune, customize and deploy (for inference) AI models towards various use-cases and deployment environments (core, edge, cloud and anything in between).
As an Principal Systems Development Engineer, you will be responsible for the engineering work necessary to develop Dell Integrated Solutions based on the Generative AI-based compute, network, and storage infrastructure environment. You will use deep technical and industry expertise in AI to do the deployment, configuration and testing of AI workloads running on these infrastructures in a highly available and scalable clustered configurations.
You will:
- Create Architecture Designs, Write Engineering Functional, Development and Test Plans based on the Generative AI solution/HPC specifications and requirements. Set up and deploy the technical environment necessary to execute all solution developments and testing required, as specified in the Engineering Test Plans.
- Troubleshoot/Resolve technical issues and conduct technical review of all program deliverables prior to completing engineering work. Capture Engineering results to author technical content in form of Design Guide, Technical White Paper, Implementation Guide, Sizing Guide, and Technical Blogs.
- Initiate review and analysis of internally and externally facing engineering-specific documentation with stakeholders. Share innovation and technology knowledge gained from each project to help other engineering peers and leaders gain knowledge, help improve internal processes and future project decisions, and drive ideas for the streamlining and automation of future development work.
- Lead functional responsibilities of Generative AI solution/HPC projects. Consistently communicates status of AI solution development work to relevant stakeholders, especially any issues that put the project timeline and/or quality as risk.
Essential Requirements:
- Bachelor’s or Master’s degree in CS or equivalent with 8 to 12 years of related (AI, Computer Vision) experience architecting, developing and deploying end-to-end (infrastructure to platform to application to workload to use-cases) solutions.
- Strong knowledge of AI Technologies and HPC, including but not limited to Generative/Cognitive AI framework/models, Machine Learning/Deep Learning, AI platforms/libraries (Hugging Face, TensorFlow, etc.), and Open-Source ML framework (pytorch), HPC software and benchmark.
- In depth knowledge of infrastructure deployment, including knowledge of installation/configuration/troubleshooting of Servers, Storage, and Networking, knowledge of Linux-based operating systems (deployment and configuration) and networking (IP addressing, subnets, VLANs, etc.).
- In depth knowledge of cluster management, including deploying and managing containers, Kubernetes and Slurm clusters.
Desirable Requirements
- Experience with Open-Source tool for AI/GenAI/ML/HPC software/solutions benchmark and additional eco-system.
- Have outstanding communication skills and possess the ability to handle multiple simultaneous projects.