Finding the best job has never been easier
Share
a Senior orData and Observability Architect. We serve and collaborate directly with NVIDIA’s rapidly growing AI, HW, and SW engineering and research teams across the company. We are looking for a technical leader to define a vision and roadmap for distributed observability systems for large-scale AI and HPC clusters and workloads and guide implementation towards this vision. You will architect systems for data collection, aggregation, enrichment, storage, retrieval, and visualization to spectacularly improve efficiency, performance, and productivity of AI and HPC workloads. You will lead technical teams to develop, deploy, andobservability solutions for multipleclusters around the world.
Be Doing:
ollaborate with AI, HW, and SW engineering and research teams to define a vision and roadmap for AI/HPC cluster observability.
evelop, test, and deploy data collectors, pipelines,visualizationandretrievalservices.
Define data collection and retention polices to balance network bandwidth, system load, and storage capacity costs with data analysis requirements.
datatoempower our engineersand researchers to improve performance, productivity, and efficiency.
Continuously improvequality, workloads, and processesthrough better observability.
What We Need
designing and building large scale, distributed observability systems.
identifyhigh value data for collection and analysis.
Experience with turning raw data into actionable reports
with observability platforms such as Apache Spark, Elastic/Open Search, Grafana, Prometheus, and other similaropen-source
Technical leadlevel Python programming experience and use of API calls
Passion for improving the productivity of others
Excellent planning and interpersonal skills
Flexibility/adaptabilityworking in a dynamic environment with changing requirements
MS (preferred) or BS in Computer Science, Electrical Engineering, or related field or equivalent experience
12+yrs of relevant experience.
Ways To Stand OutThe Crowd:
Background in computer science, machine learning, deep learning,open-sourcesoftware, infrastructure technologies, and GPU technology.
Prior experience in infrastructure software, production application software development, software development,releaseand supportmethodologyand
Experience in the management of datacenters andlarge-scaledistributed computing
Experience in working with AI researchers and/or EDA developers
track recordof driving process improvements and measuring efficiency and a passion for sharing knowledge and experience driving complex projects end-to-end.
You will also be eligible for equity and .
These jobs might be a good fit