What You Will Do
Cisco IT is building, developing, and expanding our artificial intelligence platform, which will empower the business to fundamentally change the world. You will be a technical leader in the Infrastructure Services organization, building and managing the internal NVIDIA DGX and Cisco-UCS based AI platforms at Cisco. You will provide leadership in the design and implementation of several GPU compute clusters that run demanding deep learning, high-performance computing, and computationally intensive workloads. You will be responsible for AI hardware analysis, design, procurement, and support. You will be an expert in identifying architectural changes and/or completely innovative approaches for our artificial intelligence platform.
- Technical leader who can lead and motivate teams, present, and communicate complex topics.
- Technical hands-on role in building and supporting NVIDIA & Cisco UCS based artificial intelligence platforms.
- Plan, build, and install/upgrade new systems that support NVIDIA DGX and Cisco UCS hardware and software.
- Automate configuration management, software updates, and maintenance and monitoring of GPU system availability using modern DevOps tools (Ansible, GitLab, etc.).
- Lead the advancement of artificial intelligence platforms and practices.
- Evaluate system performance based on industry-relevant benchmarks.
- Identify and optimize performance bottlenecks to drive system and workflow efficiency.
- Administer Linux systems, ranging from powerful GPU-enabled servers to general-purpose compute systems.
- Collaborate closely with internal Cisco Business Units, application teams, and cross-functional technical domains.
- Create written technical designs, documents, and presentations.
- Stay up to date with AI industry advancements and cutting-edge technologies.
- Accelerate the delivery of AI capabilities across our portfolio.
- Design new tools to monitor alerts that will help discover failures or issues before our customers.
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
Our Minimum Requirements include:
- 7+ years of previous experience deploying and administrating HPC clusters
- Familiar with GPU resource scheduling managers (Slurm (preferred), Kubernetes, and/or RunAI, etc.).
- Proficient in Hybrid Cloud, Virtualization, and Container technologies.
- Experience with provisioning tools like Base Command Manager, Warewulf, Satellite, and/or Ironic.
- Experience with Agile and DevOps operating models, including project tracking tools (e.g., Jira), Git (any Version Control systems), and CI/CD systems (e.g., GitLab, GitHub Actions, Jenkins).
- Experience with automation tools like Ansible, SaltStack, Puppet and/or Chef
- Proficient in general-purpose programming languages (Python, GoLang, Bash and/or C/C++) and development platforms and technologies.
Preferred Qualifications
- Deep understanding of operating systems, computer networks, and high-performance applications.
- Established record of leading technical initiatives, delivering results, and a commitment to fostering a supportive work environment.
- Hard-working, dedicated to providing quality support for your customers.