Your Role and Responsibilities- Implement and administrate infrastructure and solutions that support the IBM Cloud VPC.
- Support the compliance and security integrity of the environment through your work
- Partner with other teams, functional managers and program managers to deliver mission-critical services to the market
- Support development of new and enhanced existing capabilities for our compute, storage and network services
- Adopt and build on automation solutions governed by SRE principles including CI CD pipelines, configuration management, immutable infrastructure deployment, auto healing systems etc.
- Provide technical escalation support for other Infrastructure Operations teams
- Conceptualize, Design, implement, manage and create a reliable, highly performant, scalable automation solutions that can build consistency across our infrastructure
- Work with and adopt open source technologies as well as participate in new IBM innovations across IaaS
- A self-driven attitude to propose, test and implement solutions and improvements for review and consideration with your peers
Required Technical and Professional Expertise
- 5+ years of experience in data center infrastructure or relevant work experience
- 5+ years of experience in large-scale infrastructure design, engineering, and support
- 5+ years of experience in IT Change, Incident, Problem, Asset management
- 5+ years of infrastructure engineering with proven record for delivering high-quality, large-scale solutions. Experience designing architectures for scale and performance
- 5+ years of practical experience with one or more operating systems: Ubuntu (Preferred), CentOS, RHEL or Debian Linux, and Windows Servers.
- 5+ years of experience debugging issues across a Linux environment with network, storage, compute and orchestration components. Does not need to be code debugging.
- Development experience with one or more programming languages: PowerShell, Python (preferred), and Ruby
- 2+ years practical experience with orchestration that uses desired state models and/or finite state machine models of orchestration: Kubernetes(Preferred), OpenShift, etc.
- 5+ years practical experience Containerization and container orchestration: Docker(preferred) Kubernetes (preferred), OpenShift, rancher, docker swarm, docker compose
- 5+ years experience with Monitoring technologies: Sydig (preferred), Grafana, Nagios, Zenoss, ELK, Splunk, Zabbix etc.
- Familiarity with Open Telemetry concepts, Tracing, Metrics, Events and other Observability principles
- 2+ years of experience with one or more Virtualization technologies: Citrix Xen Hypervisor (Preferred), KVM(also preferred), libvirt, qemu, VMware vSphere, etc.
- 5+ years of experience with one or more automation and configuration management tools/solutions: Ansible & Terraform (Preferred), Chef, python, bash, puppet, Rundeck, etc.
- 2+ years of experience with version control systems: github(preferred), gitlab, subversion, etc.
- Basic experience with databases, both RDBMS like mysql or postrgresql, as well as non-relational databases such as etcd, TimeScaleDB, InnoDB, etc. Not a DBA role.
- Working knowledge with Network and Storage technologies
- Working knowledge with ServiceNow, JIRA, Confluence, and GitHub
- ITIL Foundation V4 certification is a plus
Preferred Technical and Professional Expertise
- Excellent verbal and written communication skills
- Highly responsible, motivated, able to work with little direction
- Experience with design and development of complex systems
- Ability to troubleshoot complex problems and customer issues
- Working knowledge of Linux clustering, HA, and Fault Tolerant system implementations: active/active, active/passive, pacemaker, keepalived, haproxy, corosync, LVM
- 2+ years of experience with complex systems and layered architecture models: OSI, Kubernetes, virtualization, TCP/IP, etc.
- Working knowledge of what TCP/IP, BGP, Sockets, routing protocols, routes an keepalived are and how they participate in debugging and Highly available systems at scale.
- Ability to debug an issue across the entire OSI stack of a typical Linux environment across storage, network, compute, OS, system tuning, orchestration.
- Ability to debug stack traces to particular libraries in code and root cause identification.
- Working knowledge of a message bus and message queues: kafka(preferred), Spark, RabbitMQ, redis, etc.
- Extensive experience with databases and debugging their usage with application stacks
- Experience with and understanding of the interaction and dependencies of a typical three tier model of application stacks, as well as cloud