Your Role and Responsibilities- Implement and automate infrastructure solutions that support IBM Cloud products and infrastructure
- Developing and Administer CI/CD systems and tools for development and test teams
- Keeping your assigned site or service up and running or getting it back up and running quickly when failure occurs
- Working closely with internal partners and teams to ensure that our infrastructure meets security, SLA, and performance requirements
- Automating work including infrastructure needs, testing, failover solutions, failure mitigation, and much more
- Persistent testing of application and infrastructure resiliency over a variety of error conditions.
- Support the compliance and security integrity of the environment
- Develop, communicate, and monitor standard processes to promote the long-term health of sustainability and health of operational development tasks.
- Standup and maintain pre-production and developer environments to support the entire development organization and improve overall team velocity
- Use metrics and analytics to determine reliability issues and remove them through automation and tooling
- Be an advocate for our customers, providing them self-diagnosing tools to resolve common issues that arise in the field
- Required to participate in code reviews for your peers’ development work, triage and solve live customer issues, and participate in all scrum activities
- Additionally, monitor, measure, and improve code and data performance for the application you help to develop
- Available for on-call shifts during daytime hours and weekends
- All of this will take place in a strong team environment, which necessitates strong communication
Required Technical and Professional Expertise
- 4-8 years of experience delivering code for active Cloud Services/Projects
- Experience debugging complex problems
- Experience designing, building, and operating large-scale production systems
- Expertise in Ansible, Bash, core Python development, and deployments in production environment is a must.
- Experience automating infrastructure, configuration management, testing, and deployments using tools like Ansible, Chef and can explain the Infrastructure as Code paradigm
- A strong understanding of diverse infrastructure platforms and infrastructure concepts required.
- Systems management experience in Linux/UNIX systems (RHEL preferred)
- Experience in Docker and containerization technologies
- Experience with cloud computing technologies
- Experience with k8s CRDs, k8s controller programming with watcher informer model
- Must have good experience in Infrastructure Operations automation and IT Service Management with hands on exposure in data center administration, configuration , Incident management and support
- 5+ years of working knowledge with one or more operating systems: Ubuntu (Preferred), RHEL, CentOS Linux, and Windows Servers
- Strong experience with one or more Virtualization technologies: KVM, Xen, Citrix Hypervisor, VMware vSphere, etc.
- Working knowledge with one or more programming tools: Bash, PowerShell, Python, Ruby and Go.
- Strong Communication skills
Preferred Technical and Professional Expertise
- Working knowledge with one or more key infrastructure tools/products: Ansible, Chef, etc.
- Working knowledge with Container technologies: Kubernetes, Docker, etc.
- Working knowledge with Monitoring technologies: Zabbix, Splunk, etc.
- Working knowledge with ServiceNow, JIRA, Confluent, and GitHub
- Must have good experience in Infrastructure Operations automation and IT Service Management with hands on exposure in data center administration, configuration , Incident management and support
- Experience with technologies enabling reliable data processing pipelines such as Kafka, Elasticsearch, Splunk; database and data visualization technologies for operations such as SQL dbs, Influxdb, Grafana, Kibana.
- Experience with event monitoring/management ecosystems like Zabbix, Nagios, Sysdig, LogDNA, ServiceNow.