Infrastructure Operation : Utilize OpenStack-based IaaS resources and optimize their provisioning to ensure efficient infrastructure operations.
Cross-Node Resource Management: Manage Kubernetes clusters across different regions and availability zones, ensuring optimal performance for use-cases and shared services while minimizing resource consumption.
Logging, Auditing, and Metrics : Implement distributed logging solutions using Loki and OpenSearch. Configure auditing for each use-case and collect Prometheus-based metrics from both platform services and use-cases.
Dashboarding and Monitoring: Develop dashboards tailored to specific needs and monitor the platform using the dashboard tools you create.
Support Platform Use-Cases : Assist use-case development teams in maximizing the platform's capabilities for their projects.
TCO Management: Automate the calculation of the total cost of ownership for platform infrastructure and licenses, and allocate these costs to each specific use-cases.
Collaboration, Documentation, and Training : Collaborate with peers across regions to support various projects, document new changes, and provide training to platform users.
What You Bring:
----------------
Bachelor's degree in Computer Science, Engineering, or a related field; advanced degrees are a plus.
Basic understanding of GPU-based computing concepts, and familiarity with AI/ML frameworks and tools such as CUDA, Kubeflow, Spark, or PyTorch.
Solid knowledge of Kubernetes and container orchestration concepts.
Proficiency in coding languages (e.g., Python, Go, Shell) for automation and infrastructure management.
Proven experience in infrastructure and operations management for cloud service solutions.
Strong problem-solving skills and the ability to diagnose and resolve complex technical issues.
Excellent communication and collaboration skills to work effectively with cross-functional teams.
Strong attention to detail and the ability to manage multiple priorities in a fast-paced environment.