Design, implement, and maintain complex Linux-based systems.
Troubleshoot and resolve critical system issues with a focus on rapid recovery.
Optimize system performance and ensure high availability.
Implement and maintain security best practices, including OS hardening and compliance.
Automate routine tasks through scripting and configuration management tools.
Cloud Administration:
Manage and maintain cloud-based Linux infrastructure (AWS, Azure, GCP).
Deploy and configure cloud resources like virtual machines, storage, and networking components.
Optimize cloud costs and implement resource scaling strategies.
Ensure the security and compliance of cloud-based systems.
Root Cause Analysis:
Conduct thorough Root Cause Analysis (RCA) to identify the underlying causes of system failures or performance issues.
Collaborate with cross-functional teams to gather data, replicate issues, and implement permanent solutions.
Create comprehensive RCA reports, system documentation, and knowledge base articles to prevent future occurrences.
Mentorship and Collaboration:
Provide guidance and support to junior team members.
Actively participate in knowledge sharing and contribute to team documentation.
Stay current with industry trends, technologies, and best practices.
What you'll bring
Experience: 5 - 10 years of professional experience in Linux system administration with a demonstrated ability to perform Root Cause Analysis
Technical Skills:
Expert-level Linux knowledge: Deep understanding of Linux internals, kernel architecture, process and memory management, filesystems, and system calls.
Troubleshooting and Diagnostics: Mastery of tools like top, htop, vmstat, iostat, sar, ps, netstat, ss, journalctl, rsyslog, dmesg, strace, lsof, tcpdump, wireshark, perf, systemd-analyze.
Networking: Advanced understanding of TCP/IP, network interfaces, routing, DNS, DHCP, firewalls, and diagnostic tools.
Security: Strong knowledge of security principles, OS hardening, compliance, and tools for vulnerability scanning and intrusion detection.
Scripting and Automation: Proficiency in Shell scripting, Python, and/or other scripting languages. Experience with Infrastructure-as-Code tools like Ansible, Puppet, Chef, or Terraform.
Cloud Infrastructure: Experience with AWS, Azure, or GCP, including core services.
Virtualization: Familiarity with Docker, Kubernetes, and other virtualization technologies.
Soft Skills:
Analytical and Problem-Solving: Exceptional ability to analyze complex issues, identify root causes, and implement effective solutions.
Communication: Excellent written and verbal communication skills, with the ability to explain technical findings clearly to both technical and non-technical audiences.
Documentation: Ability to create clear, concise, and comprehensive technical documentation.
Incident Management: Experience with ITIL or similar incident management frameworks.
Continuous Learning: A strong commitment to ongoing learning and professional development.