Job responsibilities
- Develop, test, and debug automated tasks (Apps, Systems, Infrastructure)
- Troubleshoot priority incidents, facilitate blameless post-mortems.
- Work with development teams throughout the software life cycle ensuring sustainable software releases.
- Perform analytics on previous incidents and usage patterns to better predict issues and take proactive actions.
- Build automations to reduce manual interventions for production operations.
- Build real-time monitoring and observability tools and processes.
- Build and drive adoption for greater self-healing and resiliency patterns.
- Lead and participate in performance tests; identify bottlenecks, opportunities for optimization, and capacity demands.
- Participate in the 24x7 support coverage as needed.
Required qualifications, capabilities, and skills
- Formal training or certification in software engineering concepts and 2+ years of applied experience.
- Strong development skills in Java, Python, or Scala.
- Knowledge of data preprocessing, ETL processes, and data pipeline creation.
- Experience with data storage solutions, including SQL, NoSQL databases (Cassandra ), data lakes, and S3.
- Proficiency in using cloud services like AWS EMR, EKS, EC2, and S3 for deploying and managing ML models.
- Familiarity with logging and monitoring tools such as Kibana, Splunk, Elastic Search, Dynatrace, AppDynamics, Grafana, CloudWatch, and Datadog.
- Experience with Continuous Integration & Continuous Deployment processes using tools like Jenkins and Spinnaker.
- Ability to deploy, scale, and manage ML models in production environments, optimizing for performance and cost-efficiency.
- Strong analytical and troubleshooting skills, with the ability to diagnose and resolve issues in ML pipelines and production systems.
- Excellent communication skills, with the ability to collaborate effectively with data scientists, engineers, and other stakeholders, and a willingness to stay updated with the latest trends in ML and MLOps.
Preferred qualifications, capabilities, and skills
- Relevant certifications in cloud platforms (e.g., AWS, DevOps, Certified Kubernetes Administrator).