As a Capacity Engineer, you will be responsible for analyzing, modeling, and forecasting the infrastructure needs of our organization. Your expertise in capacity engineering and forecasting will play a crucial role in optimizing resource allocation and ensuring efficient utilization of our technology assets. Experience with public and private Kubernetes based clouds, application performance principles, and workload optimization will be highly helpful in this role. The successful candidate will have strong analytical skills, superb communication abilities, and a deep understanding of technology infrastructure.
Responsibilities:
Collaborate with Core AI teams across multiple geographies to analyse, plan and execute on AI initiatives. Analyse the AI capacity intake requirements for prioritization and scheduling. Seek out and lead execution of performance optimizations of our AI related assets to ensure efficient use. Scope includes Nvidia SuperPod, On-Prem Kubernetes, Azure, and GCP based clouds
Understand key performance metrics and scaling characteristics of LLM, non-LLM AI models
Understand key concepts and sizing metrics related to RAG, Vector Search, Grounding
Influence customer choice of AI models to improve ROI and cost efficiency
Design and build dashboards to support management of AI workloads and infrastructure. Be familiar with GPU relevant metrics and how they are used. Strong experience with grafana, prometheus, thanos and ELK stacks
Analyze historical data, trends, and growth patterns to develop accurate capacity models such as compute, network, storage, and platform optimization requirements.and forecasts
Collaborate with multi-functional teams to gather relevant information on business objectives, technology requirements, and upcoming projects
Evaluate and refine existing capacity engineering processes and methodologies to improve accuracy and efficiency
Monitor system performance metrics and utilization levels to identify potential bottlenecks or areas of underutilization. Take ownership and drive for the realization of gains from improving utilization
Collaborate with technology partners to understand future technology trends and initiatives, anticipate resource demands, and develop proactive capacity plans
Conduct "what-if" scenarios to assess the impact of different business scenarios and help guide decision-making processes
Manage capacity for large federated Kubernetes environments on primarily private cloud but including some public cloud.
Must be articulate and be able to communicate capacity insights, recommendations, and performance metrics to key partners. Advocate for initiatives that provide clear business value through data.
Qualifications:
Bachelor's degree in computer science, information systems, statistics or a related field and 2+ years of experience
2 years of experience in capacity engineering, resource allocation, and forecasting in a technology-intensive environment
Specialized experience in AI
Strong proficiency with grafana, prometheus, thanos, elastic search and kibana (ELK)
Proficiency in Kubernetes. Should understand how applications operate in a k8s environment and have experience running apps in k8s.
Strong analytical skills with the ability to analyze complex data sets and identify relevant patterns and trends
Familiarity with technology infrastructure components, including Ubuntu linux, servers, databases, networks, storage systems, and cloud platforms
Knowledge of recommendation engines concepts and their application in infrastructure management
The base pay range for this position is expected in the range below:
$95,200 - $168,700משרות נוספות שיכולות לעניין אותך