Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

SAP Senior AI Observability Engineer SRE 
Hungary, Pécs 
839831731

24.06.2024

Senior AI Observability Engineer (SRE)focusing on both soft and physical layers of our global operations.

About the Role:
You will join a global & multidisciplinary SRE team of DevOps engineers, contributing to the development of AI solutions that power a stack of diverse observability services using Machine Learning and Large Language models. This role involves reshaping how we manage alerts, metrics, and logs by introducing deep learning and NLP to enhance reliability services. You will also support troubleshooting during major incidents related to our global cloud infrastructure, ensuring excellence in triage and resolution. You will help the team to reduce critical KPI's around MTTD/MTTR, Signal to Noise Ratio, and other relevant metrics using these advanced methods.



Expectations and Tasks:

  • Collaborate with engineering and product management following Agile Methodologies such as SCRUM.
  • Ability to prioritize and deliver high-quality developments under time constraints.
  • Ensure smooth operations and maximize uptime of the services we are responsible for.
  • Participate in On-Call rotational coverage, including weekends and holidays, with compensation as per local policies. Global follow the sun model with local daytime coverage.
  • Share knowledge across the team.
  • Work on data analysis & generation.
  • Support AI research & development projects.
  • Train and fine-tune AI Models.

Required Skills:

  • Fast adoption of cutting-edge technologies.
  • Advanced analytical and problem-solving mindset.
  • Strong team player with excellent communication skills.
  • Self-starter who acts with a sense of urgency to quickly move issues forward efficiently and effectively.
  • Fluent in spoken & written English.

Required Experience:

  • Development:
    • 4+ years of experience in professional or enterprise development.
    • Strong knowledge of Python & JavaScript programming languages
    • Proven experience in REST API implementation using Flask or FastAPI.
    • Experience in microservice-based development.
  • DevOps:
    • Understand CI/CD pipelines using Azure, Jenkins, Travis, or similar.
    • Hands-on experience with docker containers & Kubernetes.
    • Work with public cloud environments such as GCP/AWS/Azure.
    • Solid understanding of JSON, YAML, & Github.
    • Solid Understanding of Enterprise/Service Provider Data Center Architecture.
    • Strong familiarity with Enterprise-class Fault Monitoring and Performance Management tools.
  • Artificial Intelligence:
    • Experience with ML frameworks like PyTorch, TensorFlow, or similar.
    • Knowledge in Prompt Engineering, Large Language Models, RAG, and Embeddings.
    • Good understanding of Machine Learning Supervised/Unsupervised models.
    • Good understanding of algorithms, data structures & data patterns.

Preferred:

  • Knowledge Graphs, Graph DB's and Graph Theory
  • Experience with Elasticsearch, Splunk, or similar.
  • Experience in web development frameworks.
  • Familiarity with Terraform, HelmChart, Ansible, or similar tools.
  • Knowledge about Kubeflow, MLFlow, Dataflow, or similar technologies.

Education:

  • Bachelor's or equivalent education in Software Engineering, Computer Science, or a related field.
  • Industry Technical Certifications (CKA, Elastic Certified Engineer, RHCE, CCNA, AZ-900, etc.) and ITIL related courseware are a plus.

Service Reliability Engineering (SRE)GCID organization. Itreliability anddeveloping and enhancing observabilitythat help to either prevent or isolate an incident.