Required qualifications, capabilities, and skills
- Provide technical guidance and direction to support business objectives, collaborating with technical teams, contractors, and vendors.
- Develop secure, high-quality production code, and review and debug code written by others.
- Influence product design, application functionality, and technical operations through informed decision-making.
- Advocate for firmwide frameworks, tools, and practices within the Software Development Life Cycle.
- Promote a culture of diversity, equity, inclusion, and respect within the team.
- Architect and deploy secure, scalable cloud infrastructure platforms optimized for AI and machine learning workloads.
- Collaborate with AI teams to translate computational needs into infrastructure requirements.
- Monitor, manage, and optimize cloud resources for performance and cost efficiency.
- Design and implement continuous integration and delivery pipelines for machine learning workloads.
- Develop automation scripts and infrastructure as code to streamline deployment and management tasks.
Required Qualifications:
- Formal training or certification in software engineering concepts with 5+ years of applied experience.
- Hands-on experience in system design, application development, testing, and operational stability.
- Proficiency in programming languages such as Python and/or Golang.
- Ability to independently tackle design and functionality problems with minimal oversight.
- Background in Computer Science, Computer Engineering, Mathematics, or a related technical field.
- Strong knowledge of cloud computing delivery models (IaaS, PaaS, SaaS) and deployment models (Public, Private, Hybrid Cloud).
- Proficiency in Linux environments, including scripting and administration.
- Foundational understanding of machine learning concepts, including transformer architecture, ML training, and inference.
- Experience in solutions design and engineering, containerization (Docker, Kubernetes), and cloud service providers (AWS, Azure, GCP).
- Experience with Infrastructure as Code (Terraform, CloudFormation) and automation tools (Ansible, Chef, Puppet).
- Deep understanding of cloud component architecture: Microservices, Containers, IaaS, Storage, Security, and routing/switching technologies.
Preferred Qualifications:
- Foundational understanding of NVIDIA GPU Infrastructure software (e.g., NVIDIA DCGM, BCM, Triton Inference).
- Hands-on experience with ML frameworks such as PyTorch, TensorBoard.
- Experience with observability tools like Prometheus, Grafana.
- Experience in ML Ops and associated tooling like MLflow.
- Experience with High Performance Computing and Machine Learning frameworks such as vLLM, Ray.io, Slurm.
- Strong background in network architecture, database programming (SQL/NoSQL), and data modeling.
- Familiarity with cloud data services and big data processing tools.