Job Description:
We are seeking Site Reliability Engineers (SREs) to design, build, and maintain our next-gen platforms. The role provides opportunity to work with wide range of technologies and build a unique perspective that comes with integrating disparate services (both on-prem/off-prem) which must interact seamlessly with each other. You will work with colleagues that are fun, smart, hardworking, and driven. You will be part of a global team that is growing, giving you room to innovate and be creative.
Position Summary
- Collaborates with a diverse set of engineers, architects, and teams to design, develop, test, and implement secure, robust, highly available and scalable solutions for BofA’s External Cloud Platform
- Collaborates other software engineers and teams to design and implement deployment approaches using highly scalable, automated, continuous integration and continuous delivery pipelines.
- Responsible for all aspects of reliability, collaborates with technical experts, key stakeholders, and team members to resolve complex problems, owning the issue until you are sure it will not reoccur.
- Deep understanding of SRE practices, service level indicators, and service level objectives; proactively utilize them to resolve issues before they impact customers.
- Gather, analyze, synthesize, and develop visualizations and reporting from large, diverse data sets in service of continuous improvement of the platform.
- Implement infrastructure, configuration, and network as code for the applications and platforms in your remit.
- Identify opportunities to eliminate toil and automate the triage of issues to improve overall operational stability.
- Collaborate with a global team to identify, analyze, and resolve platform vulnerabilities.
- Proactively promotes the adoption of site reliability engineering best practices within the team and organization.
- Participate in 24x7 on-call coverage follow the sun model and performs blameless Postmortems (RCAs) as needed.
Required Skills:
- 7 years of combined experience in either SRE, software development, or infrastructure engineering (4 years with an advanced degree in Computer Science or related technical field).
- 3+ years of hands-on experience building and maintaining cloud platforms on a major cloud service provider.
- Strong experience in implementing, monitoring, and maintaining a highly scalable and resilient Data Services platform on major CSP’s like AWS, Azure or GCP.
- Strong experience with monitoring tools such as Grafana, Prometheus, Splunk, or Dynatrace, as well as cloud native tools like CloudWatch & CloudTrail, Azure Monitor and Log Analytics
- Proficiency in implementing, monitoring, and maintaining a Databricks, RDS, or OpenAI platform.
- Proficient in at least one programming language such as Python, Java/Spring Boot, and .Net; 5+ years applied experience in Python/Java
- Proficiency in implementing CI/CD pipelines with tools such as git and Jenkins, familiarity with using a GitOps model.
- Advanced knowledge of networking (firewalls, DNS, Load Balancing, Proxies, etc.)
- Advanced understanding of Linux & Windows operating systems including shell scripting
- Excellent interpersonal, organizational and communication (written, verbal, and presentation) skills are a must.
- Proven ability to work independently with minimal supervision and as part of a global team with direct responsibilities and an ability to juggle competing priorities and adapt to changes in project scope.
Desired Skills
- Strong experience working with a complex IAM infrastructure, including Active Directory, Azure AD Connect, Azure AD, and PingIdentity, Okta, or other SSO solutions.
- Proficiency in creating automation using Python, Terraform, or Ansible
- Proficiency in implementing, monitoring, and maintaining a Databricks, CosmosDB, or OpenAI platform.
- Experience in implementing, monitoring, and maintaining a highly scalable and resilient enterprise platform on Microsoft Azure using native services related to compute, storage, networking, security, and observability.
- Experience with containerization technologies such as EC2, EKS, Fargate, Openshift, or Kubernetes.
- Understanding of cost management, inventory management, FinOps model