Being the cybersecurity partner of choice, protecting our digital way of life.
Your Impact
As a Principal Engineer in the Global SRE Automation group, you will shape the future of infrastructure reliability, scale, and developer productivity. You will lead the design and development of cloud-native automation tools, streamline operational workflows, and embed resilience into every layer of the platform.
You will:
- Architect and build automation systems that support self-healing, observability, and service-level assurance
- Contribute to the developer experience and internal tooling ecosystem, driving reliability through code
- Influence the SRE strategy by introducing innovations in cloud-native backend services, Kubernetes automation, and platform engineering
- Partner with global teams to deliver reliable infrastructure, integrating AI models, event-driven systems, and data pipelines to unlock operational insights
- Set standards for code quality, system design, and operational excellence across the organization
Your Experience
- 10+ years of experience in Cloud Engineering, DevOps, or Infrastructure Software Development, with a strong focus on automation, reliability, and platform scalability
- Deep expertise in AWS and Google Cloud Platform (GCP), with strong understanding of networking, compute, serverless, and cost-optimization services
- Proficient in Python or Go, with a solid grasp of modern backend development frameworks (e.g., Flask, FastAPI, Gin) and cloud-native application design
- Hands-on experience building RESTful APIs, microservices, and cloud-native platforms supporting high availability and self-service
- Designed and integrated Generative AI and LLM-based pipelines, including Retrieval-Augmented Generation (RAG), into internal tooling and operational systems to enhance developer productivity and incident response
- Applied predictive analytics, anomaly detection, and MLOps for use cases such as cost forecasting, capacity planning, and proactive incident management
- Built and optimized Cloud FinOps tooling to monitor usage patterns, reduce waste, and provide actionable insights into cloud spend
- Developed AI-driven automation agents (bots) for cloud operations, alert triage, knowledge retrieval, and ticket deflection
- Strong experience with:
- Infrastructure-as-Code: Terraform, CDK
- Kubernetes: Cluster lifecycle management, Helm/Kustomize, GitOps (ArgoCD)
- CI/CD pipelines, observability frameworks (Prometheus, Grafana, ELK), and SRE tooling for incident automation - Proficient in SQL and NoSQL databases, such as PostgreSQL and Elasticsearch
- Exposure to Kafka and event-driven architectures for real-time data streaming and integration.
- Excellent problem-solving, debugging, and systems design skills
- Demonstrated leadership in cross-functional engineering teams, including mentoring, architectural guidance, and influencing long-term platform direction
All your information will be kept confidential according to EEO guidelines.