Job Description
Assume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability.
Job responsibilities
- Demonstrates and champions site reliability culture and practices and exerts technical influence throughout your team
- Leads initiatives to improve the reliability and stability of your team’s applications and platforms using data-driven analytics to improve service levels
- Collaborates with team members to identify comprehensive service level indicators and stakeholders to establish reasonable service level objectives and error budgets with customers
- Demonstrates a high level of technical expertise within one or more technical domains and proactively identifies and solves technology-related bottlenecks in your areas of expertise
- Acts as the main point of contact during major incidents for your application and demonstrates the skills to identify and solve issues quickly to avoid financial losses
- Documents and shares knowledge within your organization via internal forums and communities of practice
Required qualifications, capabilities, and skills
- Formal training or certification on software engineering concepts and 5+ years n reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices
- Fluency in at least one programming language such as (e.g., Python, Java Spring Boot, .Net, etc.)
- Deep knowledge of software applications and technical processes with emerging depth in one or more technical disciplines
- Proficiency and experience in observability such as white and black box monitoring, SLO alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc.
- Proficiency in continuous integration and continuous delivery tools (e.g., Jenkins, GitLab, Terraform, etc.)
- Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker, etc.)
- Experience with troubleshooting common networking technologies and issues
- Ability to identify and solve problems related to complex data structures and algorithms
- Drive to self-educate and evaluate new technology
- Ability to teach new programming languages to team members
- Ability to expand and collaborate across different levels and stakeholder groups
- Experience leading developers in a high impact fast paced environment.
- Proficient in programming languages like Python for model development, experimentation, and integration with Azure OpenAI API.
- Ability to identify and address AI/ML/LLM/GenAI challenges, implement optimizations, and fine-tune models for optimal performance in NLP applications.
- Strong collaboration and communication skills to work effectively with geographically spread cross-functional teams, communicate complex concepts, and contribute to interdisciplinary projects.
- Strong problem-solving and analytical skills with emphasize on attention to detail.
- Experience with cloud platforms, for deploying and scaling AI/ML models.
- Must have AWS Cloud knowledge
LMs and executing experiments to push the capability limits of LLM models and enhance their dependability.