Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Red hat Senior Cloud & AI Platforms Engineer - Services 
Spain 
664836368

25.09.2024

What will you do

  • Commitment to providing exceptional customer experience by using professional communication and applying product knowledge and deep troubleshooting to perform direct actions in cluster environments to resolve various issues.

  • Contribute to global initiatives and projects to constantly reduce customer effort, improve tooling, and design and write automation software to improve efficiency.

  • Act as the direct contact, adviser, and mentor for customer inquiries and issues with their Cloud & AI Platforms Services through our Customer Portal, conference call, and remote access.

  • Proactively analyze cluster status and identify single points of failure and other high-risk architecture issues; propose and implement more resilient resolutions.

  • Record customer interactions, including investigation, troubleshooting, and resolution of issues, to document diagnostic steps and issue resolution and create reusable solutions for future incidents.

  • Responsible for partnering with internal teams and external parties to deliver seamless infrastructure support for Red Hat’s Cloud Services & AI Platforms

  • Strong work ethic, able to work as part of a team and focus on customers and resolving their issues.

  • Be available to perform weekend shift duties on a rotational schedule.

What will you bring

  • Proven experience in Infrastructure Implementation, Deployment, Administration, and Production Support of container technologies and orchestration platforms (cri-o, Kubernetes, xKS, Docker, OpenShift Container Platform).

  • Exceptional technical, analytical, and troubleshooting skills using tools like curl, strace, oc (kubectl), and Wireshark analysis to investigate and form precise action plans for issue remediation with components such as networking, system performance issues, Kubernetes, OpenShift Container Platform, Service Mesh, and RESTful API calls.

  • Experience working in a Technical Support role that interfaces with Site Reliability Engineers (SRE), Development Engineering teams, and partner vendors to resolve customer issues.

  • Strong DevOps and/or MLOps background, agile concepts, application development, and deployment tools.

  • Experience with application development, ideally with Python or other languages like Go, Java, and C/C++.

  • Knowledge of training, tuning, and serving ML models using tools like Pytorch, Tensorflow, Ray, Kubeflow Pipelines, Jupyter, or similar.

  • Demonstrates solid customer-centric focus, balancing technical expertise and customer interaction while effectively managing competing priorities, learning and teaching modern technologies, and excelling in technical communication and collaboration.

The following is considered a plus:

  • Experience developing and deploying large-scale AI applications and generative AI applications.

  • Experience in training, tuning, and serving ML models using tools like Pytorch, Tensorflow, Ray, Kserve, ModelMesh, Kubeflow Pipelines, or similar

  • Knowledge of machine learning algorithms and concepts (e.g., supervised learning, unsupervised learning, deep learning) as applied to generative AI.

  • Experience supporting, tuning, and troubleshooting Jupyter environments on Kubernetes systems in production.

  • Experience as a Customer-Facing Site Reliability Engineer (SRE) or SRE or knowledge of SRE procedures, including incident management, monitoring and alerting, capacity planning, and automation of operational tasks.