- Analyze the requirements, demands, constraints and challenges of machine learning in local or global environments, design or re-design platform architecture to improve its scalability and agility, and to enable new, high-impact use cases- Develop and implement the above design, bringing it to an internal product, with observability to support efficient system management- Design and/or enhance automation of operations for infrastructure and platforms, including tools and processes of monitoring, logging and alerting, to improve scalability in both system construction and runtime operations - Support Dev and Eng efforts through provisioning operational solutions, co-design ML application architecture and drive the coordination among local and global, internal and cross-functional groups to achieve the result of success- Create performance profile for platforms and services, defining service level objectives (SLO) and driving the measurement, monitoring and evaluation over these objectives - Lead constant evaluation on system performance and reliability, discover potential faults, drive RCA and fixes