As a Sr. Site Reliability Engineer you will be responsible for providing the platform for mission critical cloud systems to maintain constant uptime, scale seamlessly, and allow for new applications and services to flourish.AS AN SRE IN THIS TEAM, YOU WILL:- Design and deploy GPU-accelerated VM and container infrastructure using platforms such as KVM, Qemu, AWS, or Google Cloud.- Implement GPU-based Kubernetes clusters to support containerized applications and services- Work with data scientists, developers, and other stakeholders to understand requirements and provide solutions for GPU-accelerated tasks.- Implement best practices for security, scalability, and high availability environments.- Monitor and optimize resource utilization to ensure performance and cost-efficiency.- Actively participate in capacity planning, scale testing, and disaster recovery exercises.- Able to troubleshoot issues across the entire infrastructure stack- Cultivate and maintain relationships with internal and external third-party vendors.