Job Responsibilities:
- Execute creative software solutions, including design, development, and technical troubleshooting, with the ability to think beyond conventional approaches to build solutions or resolve technical problems.
- Develop secure, high-quality production code, and review and debug code written by others.
- Identify opportunities to eliminate or automate the remediation of recurring issues to enhance the overall operational stability of software applications and systems.
- Lead evaluation sessions with external vendors, startups, and internal teams to drive outcomes-oriented assessments of architectural designs, technical credentials, and their applicability within existing systems and information architecture.
- Lead communities of practice across Software Engineering to promote awareness and adoption of new and leading-edge technologies.
- Contribute to a team culture of diversity, equity, inclusion, and respect.
- Develop and deploy cloud infrastructure platforms that are secure, scalable, and optimized for AI and machine learning workloads.
- Collaborate with AI teams to understand computational needs and translate these into infrastructure requirements.
- Monitor, manage, and optimize cloud resources to maximize performance and minimize costs.
- Design and implement continuous integration and delivery pipelines for machine learning workloads.
- Develop automation scripts and infrastructure as code to streamline deployment and management tasks.
Required Qualifications, Capabilities, and Skills:
- Formal training or certification in software engineering concepts with 5+ years of applied experience.
- Hands-on practical experience in delivering system design, application development, testing, and ensuring operational stability.
- Advanced proficiency in one or more programming languages such as Python and/or Golang.
- Proficiency in automation and continuous delivery methods.
- Proficient in all aspects of the Software Development Life Cycle.
- Demonstrated proficiency in software applications and technical processes within a technical discipline (e.g., cloud, artificial intelligence, machine learning, mobile, etc.).
- Proficiency in Linux environments, including scripting and administration.
- Foundational understanding of machine learning concepts, including transformer architecture, ML training, and inference.
- Experience in solutions design and engineering, containerization (Docker, Kubernetes), and cloud service providers (AWS, Azure, GCP).
- Experience with Infrastructure as Code (Terraform, CloudFormation) and automation tools (Ansible, Chef, Puppet).
- Deep understanding of cloud component architecture: Microservices, Containers, IaaS, Storage, Security, and routing/switching technologies.
Preferred qualifications, capabilities, and skills
- Foundational understanding of NVIDIA GPU Infrastructure software (e.g., NVIDIA DCGM, BCM, Triton Inference).
- Hands-on experience with ML frameworks such as PyTorch, TensorBoard.
- Experience with observability tools like Prometheus, Grafana.
- Experience in ML Ops and associated tooling like MLflow.
- Experience with High Performance Computing and Machine Learning frameworks such as vLLM, Ray.io, Slurm.
- Strong background in network architecture, database programming (SQL/NoSQL), and data modeling.
- Familiarity with cloud data services and big data processing tools.