Job Purpose
This is a 24x7 environment and the position requires shift rotation and/or weekend work.
Responsibilities
- Monitoring and Incident Management
- Monitor systems and applications within the production environment
- Diagnose and fix incidents raised through monitoring tools, conference bridges and chats
- Work with and escalate to internal and external teams to implement incident fixes, work-around and data recovery
- Open and update production incident tickets according to company standards
- Problem Management
- Investigate and update incident tickets with root cause and incident description, ensuring appropriate corrective action follow-up tickets are assigned
- Manage incident tickets to closure, ensuring incident details are complete and accurate, and all corrective actions have been completed
- System and Application Production Readiness
- Work with internal and external teams to expand and maintain operational runbooks and other documentation
- Check application and infrastructure availability and tasks at scheduled times
- Configure monitoring tools and alarms
- Deployment Management
- Production deployments
- Approve and execute production deployment tasks
- Participate in disaster recovery, business continuity and workplace recovery events
- Participate in continuous improvement programs, such as trend analysis of recurring issues
- Provide and report on performance metrics of the environment
- Follow the handover process documented to bring the next shift up to speed and highlight priority items or issues
Knowledge and Experience
- Bachelor’s degree (IT-based) or experience within IT systems support and/or operational support of applications databases within a Linux/Unix OS environment.
- Proficiency in Bash and working knowledge of a broad range of Linux core utilities and scripting
- Working knowledge of networking: specifically TCP and UDP
- Strong communication skills
- High level of general IT skills with email and MS Office Applications
- Able to think logically and critically
- Analytical problem-solving skills with an ability to identify root cause(s)
- Able to work as a team player across the organization
- Able to build and maintain effective relationships with individuals and the team as a whole
- Ability to be organized and decisive while under pressure
- Excellent time management skills
- Able to manage priorities and multi-task
- Self-confident and assertive