What you will do:
Design and build softwarethat collects, transforms, and publishes health data about our global GPU fleet.
Develop micro-services and data pipelinesin Go or Python that ingest and normalize data from many diverse sources—routing millions of records per day (Kafka, Airflow, Kinesis).
Instrument production infrastructure and workloadsrunning on Kubernetes and bare-metal clusters; add tracing and metrics hooks for deeper insights.
Automate deployments and testingwith CI/CD (GitLab, Argo) and IaC (Terraform), ensuring repeatable, low-touch releases.
Participate in the full lifecycle of cloud services—from design docs and code reviews through deployment, monitoring, and continuous improvement.
Collaborate with other engineersto debug live issues and turn post-incident insights into durable code fixes.
Contribute to internal toolingand dashboards that help engineers visualize fleet health, utilization, and capacity trends.
What we need to see:
Actively pursuing aBS or MSin Computer Science, Computer Engineering, or a closely related quantitative field (e.g., Physics or Mathematics).
Solid understanding ofdistributed‑systems fundamentals, modern software‑engineering practices, and data‑modeling principles.
Proficiency in at least one programming language—preferablyPython or Go.
Working knowledge ofLinux, basic networking concepts, andKubernetescontainer orchestration.
Ways to stand out from the crowd:
systematic, analytical problem‑solving approachpaired with clear written and verbal communication skills and a strong sense of ownership.
Demonstrated ability todebug, optimize, and automatecode or workflows with minimal guidance.
Hands‑on experiencebuilding, deploying, and operating servicesin a public‑cloud or large on‑prem environment.
You will also be eligible for Intern
משרות נוספות שיכולות לעניין אותך