As a Senior Lead Site Reliability Engineer at JPMorgan Chase within the Digital Private Markets department, you will be at the center of a rapidly growing field in technology. Your skills will be crucial in promoting innovation and modernizing the world's most complex and mission-critical systems. You will play a significant role in managing our core model hosting, deployment, and monitoring infrastructure in AWS. Your expertise will help us solve complex and broad business problems with practical and straightforward solutions. You will also serve as a leader and mentor to junior engineers, enabling the downstream Data Science and ML Engineering teams as they execute on our product roadmap
Job responsibilities
- Develops and maintains infrastructure as code to support Data Science and Machine Learning initiatives
- Designs and implements automated continuous integration and continuous delivery pipelines for the Data Science teams to develop and train AI/ML models
- Mentors junior MLops engineers and Data Scientists, setting standards for model deployment and maintenance
- Leads technical discussions with developers, key stakeholders, and team members to resolve complex technical problems
- Builds technical roadmaps in collaboration with senior leadership and identifies risks or design optimizations
- Proactively resolve issues before they impact internal and external stakeholders of deployed models
- Champions the adoption of MLOps best-practices within your team
- Optimizes workloads for production and manages performance and observability for these workloads
Required qualifications, capabilities, and skills
- Formal training or certification on MLOps concepts and proficient advanced experience managing the deployment of models in production environments
- Excellent communication skills and the ability to explain technical concepts to non-technical audiences
- Practical knowledge of MLOps culture and principles; familiarity with how to scale these ideas to support multiple data science teams
- Can articulate the importance of monitoring and observability in the AI/ML space. Enforces its implementation & use across an organization
- Domain knowledge of machine learning applications and technical processes within the AWS ecosystem.
- Extensive expertise with Terraform, containers and container orchestration, especially Kubernetes
- Knowledge of continuous integration and continuous delivery tools like Jenkins, GitLab, or Github Actions & associated best practices
- Expert level in the following programming languages: Python, Bash
- Deep working knowledge of DevOps best practices, Linux, and networking internals
- Understanding of the different roles served by data engineers, data scientists, machine learning engineers, and system architects, and how MLOps contributes to each of these workstreams
qualifications, capabilities, and skills
- Comfortable with team management, fostering collaboration, promoting design patterns, and presenting technical concepts to non-technical audiences
- Understands how to break down large concepts and goals into smaller requirements and train junior engineers on how to execute against these requirements
- Experience with ML model training and deployment pipelines, managing scoring endpoints in the financial industry
- Familiarity with observability concepts and telemetry collection using tools such as Datadog, Grafana, Prometheus, Splunk, and others
- Experience working with ML engineering platforms such as Databricks and Sagemaker
- Experience working with Data Engineering technologies such as Snowflake and Airflow
- Comfortable troubleshooting common containerization technologies and issues