Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

Facebook SiteOps Data Center Production Operations Engineer 
United States, Oregon, Hillsboro 
674981420

27.03.2025
SiteOps Data Center Production Operations Engineer Responsibilities
  • Support platform health by successfully resolving and closing complex tickets, while addressing the overall issue (i.e. addressing root cause) including, but not limited to, remote troubleshooting and physical inspection of services in data halls.
  • Motivate and support team members through identified growth opportunities, champion a positive attitude and work to instill positive team behaviors.
  • Perform root cause analysis of complex technical issues within the data center, ranging from automated tooling to hardware failures and network issues.
  • Support the geographical area and local point of contact on the introduction of new platforms and hardware to the site and area, accelerating the time it takes to bring these products to sustained mass production.
  • Be the Production Operations subject-matter expert with cross-functional teams and external vendors on large scale data center projects and initiatives.
  • Lead collaboration with cross-functional teams on projects and initiatives related to topics such as process, hardware and automation to improve global datacenter operations.
  • Use tools and data analysis effectively to identify issues that are larger in scope and which impact one or multiple Data Centers. Take actions to communicate with all stakeholders appropriately and manage or escalate as needed.
  • Drive corrective actions by working with internal hardware teams and vendors to help drive complex technical issues to resolution, provide an ownership stake in ensuring high quality levels of hardware, and influence future design to ensure ease of serviceability.
  • Utilize expert technical and mentorship skills to enable others in solving complex and systemic hardware and/or software issues at scale.
  • Continuously evaluate and identify areas for improvement in processes, tools, and systems to optimize efficiency throughout the data center.
  • Use data analytics to drive maximum server fleet up-time and utilization rates by understanding hardware failure rates and SLAs to customers. Identify trends and systemic issues in the fleet to drive resolution.
  • Maintain and update documentation i.e. procedures, runbooks and guides. Has the technical expertise, while understanding the needs of the organization, to lead efforts to develop, facilitate and improve upon org level technical training.
  • Serve as an escalation point for the local Site On Call Engineer, with participation levels in the on-call rotation varying by site.
  • Travel up to 15% of the time.
Minimum Qualifications
  • BS, BA or BEng in technical field or commensurate experience.
  • 10+ years of technical IT experience within a Data Center environment, in a role such as Lead Engineer, Systems Administrator, DevOps Engineer, or Site Reliability Engineer.
  • Experience leading technical projects related to areas such as process improvement, technology, and/or automation. Brings peers, partners, and other resources into the project where additional expertise is needed, and to provide growth and learning opportunities for others.
  • Expert in Linux in a complex IT environment with the capacity to triage, debug, and troubleshoot complex, systemic issues.
  • Extensive hands-on experience and knowledge of server hardware and components, including storage.
  • Expert knowledge of the interdependencies of data center functions and technologies including electrical, cooling, structured cabling, security, and network.
  • Experience managing multiple technical issues concurrently driving to the root cause.
  • Capacity to communicate effectively, in a clear and concise manner, appropriately tailoring messages to the audience. Clearly explains technical problems with data and analysis, and provides detailed feedback and solutions.
  • Experience in debugging, modifying and developing commonly used scripting or programming languages in at least one major language and orchestration system such as: Bash, PHP, Python, SQL, Rust, Go, Puppet, Chef, or Ansible.
  • Knowledge of out-of-band/lights-out server communication methods, such as IPMI and serial console.
Preferred Qualifications
  • Experience with large-scale AI implementations.
  • Six Sigma knowledge/certification.
  • PMP or equivalent project portfolio experience.
  • Previous direct people leadership experience.
About Meta

$62.98/hour to $186,000/year + bonus + equity + benefits
Individual compensation is determined by skills, qualifications, experience, and location. Compensation details listed in this posting reflect the base hourly rate, monthly rate, or annual salary only, and do not include bonus, equity or sales incentives, if applicable. In addition to base compensation, Meta offers benefits. Learn more about at Meta.