

Share
What you will do:
Own the resilience testing roadmap for vLLM and llm-d: define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD
Design GPU/accelerator-aware fault experiments that target vLLM and the stack beneath it (drivers, GPU Operator/DevicePlugin, NCCL/collectives, storage/network paths, NUMA/topology)
Build an automated harness (preferably extending krkn-chaos (https://github.com/krkn-chaos/krkn) ) to run controlled experiments with scoped blast radius, and evidence capture (logs, traces, metrics)
Integrate fault signals into pipelines (GitHub Actions or otherwise) as resilience gates alongside performance gates
Develop detection and diagnostics: dashboards and alerts for pre-fault signals (e.g., vLLM queue depth, GPU throttling, P2P downgrades, KV-cache pressure, allocator fragmentation)
Triage and root-cause resilience regressions from field/customer issues; upstream bugs and fixes to vLLM and llm-d
Explore and experiment with emerging AI technologies relevant to software development and testing, proactively identifying opportunities to incorporate new AI capabilities into existing workflows and tooling.
Publish learnings (internal/external): failure patterns, playbooks, SLO templates, experiment libraries, and reference architectures; present at internal/external forums
What you will bring:
3+ years in reliability, and/or performance engineering on large-scale distributed systems
Expertise in systemsâlevel software design
Expertise with Kubernetes and modern LLM inference server stack (e.g., vLLM, TensorRT-LLM, TGI)
Observability & forensics skills with experience with Prometheus/Grafana, OpenTelemetry tracing, eBPF/BPFTrace/perf, Nsight Systems, PyTorch Profiler; adept at converting raw signals into actionable narratives.
Fluency in Python (data & ML), strong Bash/Linux skills
Exceptional communication skills - able to translate raw data into customer value and executive narratives
Commitment to openâsource values and upstream collaboration
The following is considered a plus:
Masterâs or PhD in Computer Science, AI, or a related field
History of upstream contributions and community leadership, public talks or blogs on resilience, or chaos engineering
Competitive benchmarking and failure characterization at scale.
The salary range for this position is $127,890.00 - $211,180.00. Actual offer will be based on your qualifications.
Pay Transparency
â Comprehensive medical, dental, and vision coverage
â Flexible Spending Account - healthcare and dependent care
â Health Savings Account - high deductible medical plan
â Retirement 401(k) with employer match
â Paid time off and holidays
â Paid parental leave plans for all new parents
â Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit

Share
äž»èŠè·å:
Red Hatã®Cloud補åããã³ãœãªã¥ãŒã·ã§ã³ãç¹ã«Red Hat Enterprise Linux (RHEL) ãçšããã·ã¹ãã æ§ç¯ã«é¢ããæè¡æ å ±ãæäŸããããšã
補åç¥èãæ·±ãæè¡ç¥èãé§äœ¿ããé¡§å®¢ã«æŠå¿µå®èšŒ (POC)ããã¬ãŒã³ããŒã·ã§ã³ããã¢ãæäŸããããšã
èŠèŸŒã¿å®¢ã«å¯ŸããŠè€éãªãœãªã¥ãŒã·ã§ã³ã玹ä»ããŠã䟡å€é§ååã®ã¢ãŒããã¯ãã£ãŒãèšèšãããã®ãããªæè¡ãœãªã¥ãŒã·ã§ã³ã®ã¢ããªã±ãŒã·ã§ã³ãè²»çšå¯Ÿå¹æã説æããããšãç¹ã«LinuxããŒã¹ã®ãœãªã¥ãŒã·ã§ã³ã«ãããåªäœæ§ã匷調ããŸãã
ã»ãŒã«ã¹ããŒã ãšé£æºããŠãå¥çŽãæ°ãã«ç²åŸã§ããããã«ãœãªã¥ãŒã·ã§ã³ãæäŸããããšã
顧客ã®ããžãã¹ã IT ç°å¢ãæ·±ãçè§£ããŠãã»ãŒã«ã¹ããŒã ãšé£æºããRed Hat 補åïŒç¹ã«RHELïŒãã©ã®ããã«åãå ¥ããããšãã§ããããè©äŸ¡ããããšã
å šã¹ããŒã¯ãã«ããŒã«äŸ¡å€ã鲿ãç¶æ³ãäŒããããšã
å¿ é ã¹ãã«:
ããªã»ãŒã«ã¹ãã»ãŒã«ã¹ãšã³ãžãã¢ãªã³ã°ããœãªã¥ãŒã·ã§ã³ã¢ãŒããã¯ãçã®çµéšã
Red HatãŸãã¯åé¡ã®è£œåãç¹ã«Red Hat Enterprise Linux (RHEL)ãKubernetesãã¯ã©ãŠããã€ãã£ãæè¡ã®ææ¡ãèšèšãæ§ç¯ãéçšã®çµéšã
ãªãŒãã³ãœãŒã¹ãç¹ã«Linuxãžã®ç±æãã¯ã©ãŠãããœãããŠã§ã¢ã«é¢ããç¥èãã客æ§ã®ããžãã¹ã IT ã®åé¡ã«é¢ããæ·±ãçè§£åããã©ã³ã¹è¯ãæããããšã
æ¡ä»¶ãé²ããŠããããã®åªããã³ãã¥ãã±ãŒã·ã§ã³åããã¬ãŒã³åã亀æžåã
ã客æ§ïŒãšã³ãžãã¢ãªã³ã°ãããžãã¹ããšã°ãŒã¯ãã£ãã¬ãã«ïŒãšã®é¢ä¿æ§ç¯åã
ç¶ç¶çã«åŠç¿ãæ°ããã±ã€ãããªãã£ãç¿åŸããææ¬²ãããããšã
åºç€çãªè±èªå(ç¹ã«ãªãŒãã£ã³ã°ãšãªã¹ãã³ã°)ã
Javaã䜿çšããã¢ããªã±ãŒã·ã§ã³éçºçµéšãããã°å°å¯ã
IoT / æ©æ¢°åŠç¿(AI) / FinTech / ã€ã³ãã°ã¬ãŒã·ã§ã³ / ãã€ã¯ããµãŒãã¹ãããããªãã¯ã¯ã©ãŠãã®å©çšçµéšãããã°å°å¯ã
These jobs might be a good fit

Share
è·åå 容:
æè¡ã¢ããã€ã¶ãŒãšããŠã販売åãã販売åŸã®å®è£ ãŸã§ã客æ§ãå°ããå°å ¥ã確å®ã«æåãããã
ãã¢ãã¯ãŒã¯ã·ã§ããããã€ããããããžã§ã¯ããéããŠæè¡ã®æ€èšŒãäž»å°ããã客æ§ã®ããŒãºãš Ansible ã®æ©èœãçµã³ä»ããã
ã»ãŒã«ã¹ããŒã ãæ¯æŽããäžå®æ°Žæºã®ææãã客æ§ã«æäŸããããã«ãåå©çšå¯èœãªãœãªã¥ãŒã·ã§ã³ã®æ çµã¿ãšã³ã³ãã³ããéçºããã
補åããŒã ãšååããŠã«ã¹ã¿ããŒãšã¯ã¹ããªãšã³ã¹ãåäžãããRed Hat 瀟å ã§ã客æ§ã®ããŒãºã代åŒããã
ã客æ§ã®æåãå®çŸããããã«ãRFP ã«å¯Ÿããåçã®äœæãããŒã ã®äžå¡ãšããŠæ¯æŽããã
æè¡ã¹ãã«:
Ansible Automation Platform (èªå®è³æ ŒãæãŸãã) ããã³ Puppet/Chef/SaltStack/Terraform ãªã©ã®ããŒã«ã«é¢ããå°éç¥èã åªããå®è·µçã¹ãã«ã
èªåååéã§ 6 幎以äžãã¢ãŒããã¯ãã£ãŒ/éçº/ã³ã³ãµã«ãã£ã³ã°åéã§ 5 - 10 幎ã®çµéšã
Linux (RHEL/Satellite)ãCisco ãããã¯ãŒã¯èªååãDevOps ææ³ã«ç²ŸéããŠããããšã
ããžãã¹ã¹ãã«:
çµå¶å¹¹éšã¬ãã«ã®é¢ä¿è ã«åãããããšã³ã¿ãŒãã©ã€ãºäŒæ¥ã® IT 課é¡ã«å¯ŸåŠããã¯ãã¹ãã©ãããã©ãŒã ãœãªã¥ãŒã·ã§ã³ãææ¡ããèœåã
å€§èŠæš¡ãª IT çµç¹å šäœãšé¢ä¿ãæ§ç¯ãããšã³ãããŒãšã³ãã®æŠå¿µå®èšŒããã»ã¹ã管çããçµéšã
æãŸããè³æ Œ:
Red Hat èªå®è³æ Œ (RHCEãAnsible SpecialistãArchitect) ããã³ã³ã³ãã¥ãŒã¿ãŒãµã€ãšã³ã¹/ãšã³ãžãã¢ãªã³ã°ã®åŠäœã
æ¥çãžã®è²¢ç® (ãã¯ã€ãããŒããŒãã«ã³ãã¡ã¬ã³ã¹ãªã©) ãéããŠæ¥çã®ç¬¬äžäººè ãšããŠã®å°äœãç¯ããåžžã«èªåååéã®ææ°ååãææ¡ããŠããã
These jobs might be a good fit

Share
è·åå 容:
Red Hat Ansible Automation Platform ã®ãœãªã¥ãŒã·ã§ã³ãšãŠãŒã¹ã±ãŒã¹ã«åºã¥ã顧客ã¢ã«ãŠã³ãã®ããžãã¹æé·æŠç¥ãæ åœãã
ã¢ã«ãŠã³ããã©ã³ã®çå®ããã»ã¹ã«ãã㊠Account ããŒã ãšååããã客æ§ã®ããžãã¹æšé²èŠå ãåæããŠããã¯ãããžãŒäž»å°ã®ã€ãããŒã·ã§ã³ãšããžã¿ã«å€é©ãå®çŸããããã®éèŠãªèŠçŽ ãšã㊠Red Hat ã®èªååãœãªã¥ãŒã·ã§ã³ãäœçœ®ä»ããã¹ããŒãªãŒãäœæãã
ã¢ã«ãŠã³ã管çããŒã ããœãªã¥ãŒã·ã§ã³ã¢ãŒããã¯ãããããã§ãã·ã§ãã«ãµãŒãã¹ããŒã ãšé£æºããŠãèŠèŸŒã¿å®¢ã®çºæããæçŽãŸã§ã®è€éãªè²©å£²ãµã€ã¯ã«ã管çãã
å®éçããã³å®æ§çãªããã©ãŒãã³ã¹ã®æåŸ ã«å¿ãã
ãªãŒããŒã·ããã¹ãã«ãšå°éå®¶ãšããŠã®è±å¯ãªçµéšã掻ãããçµå¶å¹¹éš (C ã¬ãã«ã®æææ±ºå®è ) ã«åããããŠä¿¡é Œãç²åŸããããšã§ãå€é©ããããããããžã§ã¯ããåµåºãã
Red Hat ãã¯ãããžãŒãœãªã¥ãŒã·ã§ã³ãããžãã¹ã«ãããã广ã瀺ããŠãã客æ§ããããžã§ã¯ãã«åãçµãã¹ã説åŸåã®ããçç±ãç²åŸãã
ã客æ§ã®ããžãã¹èŠä»¶ã«åãã㊠Red Hat ã®ãœãªã¥ãŒã·ã§ã³ãã«ã¹ã¿ãã€ãºãã
Red Hat ãœãªã¥ãŒã·ã§ã³ãããããå·®å¥åãããããžãã¹äŸ¡å€ãš Red Hat ã®ç«¶äºåªäœæ§ããã客æ§ã®æææ±ºå®è ã«çè§£ããŠããã ãåæãåŸã
Red Hat ã®ã»ãŒã«ã¹ããŒã ãšããŒãããŒããRed Hat ã®ãœãªã¥ãŒã·ã§ã³ãããããããžãã¹äŸ¡å€ã广çã«äŒæ¥ã«äŒããããããæ¯æŽãã
Red Hat ã®ãžã£ãŒããŒããŒã¹ã®ãµãŒãã¹ãšã³ã²ãŒãžã¡ã³ãããã°ã©ã ãšåçšè³Œè²·ããã°ã©ã ãæŽ»çšããã客æ§ãšã®é·æçãã€æŠç¥çãªé¢ä¿ãæ§ç¯ãã
å¿åè³æ Œ:
10 幎以äžã®èªååããã³ç®¡çãœãããŠã§ã¢è£œåãã¯ã©ãŠããµãŒãã¹ããŸãã¯é¢é£ãã¯ãããžãŒè£œåã®è²©å£²çµéš
䟡å€ããŒã¹ã®ãœãªã¥ãŒã·ã§ã³è²©å£²çµéšãã客æ§ã®ããžãã¹ç®æšãå€é©ç®æšãããã¯ãããžãŒãœãªã¥ãŒã·ã§ã³ãæäŸãã䟡å€ãšçµã³ä»ããèœå
åµé çãªæèåãã³ãã¥ãã±ãŒã·ã§ã³èœåãããã³ãã¬ãŒã³ããŒã·ã§ã³ã¹ãã«
ãªãŒãã³ãœãŒã¹ãã¯ãããžãŒãžã®æ ç±ãš Red Hat ã®ãœãããŠã§ã¢ãµãã¹ã¯ãªãã·ã§ã³ããžãã¹ã¢ãã«ã®çè§£
é¡§å®¢ã®æåãå®çŸããããã«ãã°ããŒãã«ãã€éšé暪æçãªããŒã ãšã·ãŒã ã¬ã¹ã«é£æºããŠããå®çžŸ
以äžã®åéã«ãããå°éç¥è:
IT ã®èªååãšç®¡ç
ããžãã¹ããã»ã¹ã®èªåå
ãããã£ãã¯ããã»ã¹ãªãŒãã¡ãŒã·ã§ã³ (RPA)
IT ã»ãã¥ãªãã£ãŒãšã³ã³ãã©ã€ã¢ã³ã¹
人工ç¥èœ (AI) ãšéçš
DevOpsãç¶ç¶çã€ã³ãã°ã¬ãŒã·ã§ã³ (CI) ããã³ç¶ç¶çããªããªãŒ (CD)ããã¹ãããœãããŠã§ã¢éçºã©ã€ããµã€ã¯ã« (SDLC)ãã¢ãžã£ã€ã«ææ³
ãã€ããªããã¯ã©ãŠãããããªãã¯ã¯ã©ãŠããããã³ãã©ã€ããŒãã¯ã©ãŠã
ã³ã³ãããŒãš Kubernetes
ãã¯ãããžãŒãœãªã¥ãŒã·ã§ã³ã®ããžãã¹äŸ¡å€ãæç€ºããèœå
æ¶è²»ããŒã¹ã®äŸ¡æ Œã¢ãã«ããœãããŠã§ã¢ãµãã¹ã¯ãªãã·ã§ã³ãšã©ã€ã»ã³ã¹
Red Hat ã®ãœãããŠã§ã¢ããŒããã©ãªãªãšç«¶å補åã«é¢ããçè§£
These jobs might be a good fit

Share
RedHatãœãªã¥ãŒã·ã§ã³ã¢ãŒããã¯ãããŒã ã§ã¯ã ã客æ§ã®èª²é¡ã«ãããããœãªã¥ãŒã·ã§ã³ã®ææ¡ãè¡ãããçµéšè±å¯ãªããªã»ãŒã«ã¹ãšã³ãžãã¢ãåéããŠããŸãããã®ããžã·ã§ã³ã§ã¯ãã客æ§äŒæ¥ã®CxOãITéšéã®è²¬ä»»è ã«å¯ŸããŠãã©ãèªç€Ÿã®ïŒ©ïŒŽãããžãã¹ãæ¥åã«è²¢ç®ããŠããã¹ããããã®éçãšããã«è³ããŸã§ã®å ·äœçãªããŒãããããæãããµããŒãããŠããããžã·ã§ã³ã§ããäžæ¹ã§ãã客æ§ã®ããžã¿ã«ãã©ã³ã¹ãã©ãŒã¡ãŒã·ã§ã³ãå®çŸããããã«æ»ãã®ITæŠç¥ãšãšãã«éèŠã«ãªã£ãŠããŠããã®ããæ¢åã®ITè³ç£ã®éçšã®å¹çåãã³ã¹ãã»ãŒãã³ã°ã§ãããã®æ»ããšå®ãã®æŠç¥ããã©ã³ã¹ããç«æ¡ããã客æ§äŒæ¥ã®ãããªãæé·ãæ¯ããŠããããšãããã·ã§ã³ãšãªããŸããã客æ§ã®æè¡ã¢ããã€ã¶ãŒãšããŠæŽ»åããããã«ãæè¡çãªç¥èãšãªãŒãã³ãœãŒã¹ãžã®ç¥èŠãé«ãã³ãã¥ãã±ãŒã·ã§ã³èœåããã¥ãŒãã³ã¹ãã«ãšå ±ã«ãæ åœããæ¥çã®ç¥èãšèª²é¡ãå°æ¥ã®æ¹åæ§ãçè§£ããäžã§ITæŠç¥ãç«æ¡ã§ããèœåãå¿ èŠã§ãã
What you will do:
æ åœããæ¥çãã客æ§ãæ·±ãçè§£ããã客æ§ã®ITæŠç¥ã®ç«æ¡ãå°æ¥èšç»ã®ãã¶ã€ã³ãææ¡ããããš
顧客ã®ããžãã¹ã IT ç°å¢ãæ·±ãçè§£ããã»ãŒã«ã¹ããŒã ãšé£æºããªããå¶æ¥æŠç¥ãç«æ¡ããããš
ã客æ§ã³ãŒã«ãWorkshopã®éå¬ãéããŠé¡§å®¢ã®èª²é¡ãæ£ããçè§£ããRed Hat 補åãã©ã®ããã«åãå ¥ããããšãã§ããããè©äŸ¡ããããš
ã»ãŒã«ã¹ããŒã ãšé£æºããå¥çŽãç²åŸã§ããããæ¡ä»¶ãæšé²ããããš
Red Hatã®è£œåããã³ãœãªã¥ãŒã·ã§ã³ã䜿ã£ãã·ã¹ãã æ§ç¯ã«é¢ããæè¡æ å ±ãæäŸããããš
補åç¥èãæ·±ãæè¡ç¥èãé§äœ¿ãé¡§å®¢ã«æŠå¿µå®èšŒ (POC)ããã¬ãŒã³ããŒã·ã§ã³ããã¢ãæäŸããããš
å šã¹ããŒã¯ãã«ããŒã«é²æãç¶æ³ãäŒããããš
What you will bring:
ããªã»ãŒã«ã¹ãã»ãŒã«ã¹ãšã³ãžãã¢ãªã³ã°ããœãªã¥ãŒã·ã§ã³ã¢ãŒããã¯ãçã®çµéš
ã·ã¹ãã æ§ç¯ïŒèŠä»¶å®çŸ©/èšèš/å®è£ /ãã¹ãïŒã«åç»ããçµéš
æ¡ä»¶ãé²ããŠããããã®åªããã³ãã¥ãã±ãŒã·ã§ã³åããã¬ãŒã³åã亀æžåã«å ãããšã³ãžãã¢ãªã³ã°ãããžãã¹ããšã°ãŒã¯ãã£ãã¬ãã«ãšã®é¢ä¿æ§ç¯ã®å®çžŸ
顧客ã®ããžãã¹ã®åé¡ãæè¡çåé¡ãçè§£ãããœãªã¥ãŒã·ã§ã³ãã©ã®ããã«é¡§å®¢ã®ããŒãºãèŠä»¶ã«å¯Ÿå¿ããŠããã®ãã广çã«èª¬æããèœå
ãšã³ã¿ãŒãã©ã€ãºãœãªã¥ãŒã·ã§ã³ããã³ã¢ãŒããã¯ãã£ãŒã®ç¥èãæããããš (äŸ: ã¯ã©ãŠããããã°ããŒã¿ãä»®æ³åãã¹ãã¬ãŒãžãRDBMSãOracle ã SAP ãªã©ã® ERPãããã«ãŠã§ã¢ãã¯ã©ã¹ã¿ãªã³ã°ãé«å¯çšæ§)
UNIX ãŸã㯠Linux ã·ã¹ãã 管çãçµ±åããŸãã¯éçºã®çµéš
æè¿ããçµéšãã¹ãã«
æè¡çãªèª²é¡ã解決ããWorkshopããæ¥å課é¡ãæ¥åã¢ã€ãã¢åºãã®ãããªDiscovery Sessionã®éå¬çµéš
ã·ã¹ãã éšéã ãã§ãªããæ¥åéšéãªã©ã®ãè€æ°éšéãšã®ã客æ§ãšã®ã³ãã¥ãã±ãŒã·ã§ã³ãææ¡çµéš
Red Hat OpenShift/Red Hat Ansible Automation Platformã®ææ¡ãèšèšæ§ç¯ã®çµéš
xKS, k8s, AIã䜿ã£ãæ¥åã·ã¹ãã ã®ææ¡ãæ§ç¯çµéš
These jobs might be a good fit

Share
RedHatãœãªã¥ãŒã·ã§ã³ã¢ãŒããã¯ãããŒã ã§ã¯ã ã客æ§ã®èª²é¡ã«ãããããœãªã¥ãŒã·ã§ã³ã®ææ¡ãè¡ãããçµéšè±å¯ãªããªã»ãŒã«ã¹ãšã³ãžãã¢ãåéããŠããŸãããã®ããžã·ã§ã³ã§ã¯ãã客æ§äŒæ¥ã®CxOãITéšéã®è²¬ä»»è ã«å¯ŸããŠãã©ãèªç€Ÿã®ïŒ©ïŒŽãããžãã¹ãæ¥åã«è²¢ç®ããŠããã¹ããããã®éçãšããã«è³ããŸã§ã®å ·äœçãªããŒãããããæãããµããŒãããŠããããžã·ã§ã³ã§ããäžæ¹ã§ãã客æ§ã®ããžã¿ã«ãã©ã³ã¹ãã©ãŒã¡ãŒã·ã§ã³ãå®çŸããããã«æ»ãã®ITæŠç¥ãšãšãã«éèŠã«ãªã£ãŠããŠããã®ããæ¢åã®ITè³ç£ã®éçšã®å¹çåãã³ã¹ãã»ãŒãã³ã°ã§ãããã®æ»ããšå®ãã®æŠç¥ããã©ã³ã¹ããç«æ¡ããã客æ§äŒæ¥ã®ãããªãæé·ãæ¯ããŠããããšãããã·ã§ã³ãšãªããŸããã客æ§ã®æè¡ã¢ããã€ã¶ãŒãšããŠæŽ»åããããã«ãæè¡çãªç¥èãšãªãŒãã³ãœãŒã¹ãžã®ç¥èŠãé«ãã³ãã¥ãã±ãŒã·ã§ã³èœåããã¥ãŒãã³ã¹ãã«ãšå ±ã«ãæ åœããæ¥çã®ç¥èãšèª²é¡ãå°æ¥ã®æ¹åæ§ãçè§£ããäžã§ITæŠç¥ãç«æ¡ã§ããèœåãå¿ èŠã§ãã
What you will do:
æ åœããæ¥çãã客æ§ãæ·±ãçè§£ããã客æ§ã®ITæŠç¥ã®ç«æ¡ãå°æ¥èšç»ã®ãã¶ã€ã³ãææ¡ããããš
顧客ã®ããžãã¹ã IT ç°å¢ãæ·±ãçè§£ããã»ãŒã«ã¹ããŒã ãšé£æºããªããå¶æ¥æŠç¥ãç«æ¡ããããš
ã客æ§ã³ãŒã«ãWorkshopã®éå¬ãéããŠé¡§å®¢ã®èª²é¡ãæ£ããçè§£ããRed Hat 補åãã©ã®ããã«åãå ¥ããããšãã§ããããè©äŸ¡ããããš
ã»ãŒã«ã¹ããŒã ãšé£æºããå¥çŽãç²åŸã§ããããæ¡ä»¶ãæšé²ããããš
Red Hatã®è£œåããã³ãœãªã¥ãŒã·ã§ã³ã䜿ã£ãã·ã¹ãã æ§ç¯ã«é¢ããæè¡æ å ±ãæäŸããããš
補åç¥èãæ·±ãæè¡ç¥èãé§äœ¿ãé¡§å®¢ã«æŠå¿µå®èšŒ (POC)ããã¬ãŒã³ããŒã·ã§ã³ããã¢ãæäŸããããš
å šã¹ããŒã¯ãã«ããŒã«é²æãç¶æ³ãäŒããããš
What you will bring:
ããªã»ãŒã«ã¹ãã»ãŒã«ã¹ãšã³ãžãã¢ãªã³ã°ããœãªã¥ãŒã·ã§ã³ã¢ãŒããã¯ãçã®çµéš
ã·ã¹ãã æ§ç¯ïŒèŠä»¶å®çŸ©/èšèš/å®è£ /ãã¹ãïŒã«åç»ããçµéš
æ¡ä»¶ãé²ããŠããããã®åªããã³ãã¥ãã±ãŒã·ã§ã³åããã¬ãŒã³åã亀æžåã«å ãããšã³ãžãã¢ãªã³ã°ãããžãã¹ããšã°ãŒã¯ãã£ãã¬ãã«ãšã®é¢ä¿æ§ç¯ã®å®çžŸ
顧客ã®ããžãã¹ã®åé¡ãæè¡çåé¡ãçè§£ãããœãªã¥ãŒã·ã§ã³ãã©ã®ããã«é¡§å®¢ã®ããŒãºãèŠä»¶ã«å¯Ÿå¿ããŠããã®ãã广çã«èª¬æããèœå
ãšã³ã¿ãŒãã©ã€ãºãœãªã¥ãŒã·ã§ã³ããã³ã¢ãŒããã¯ãã£ãŒã®ç¥èãæããããš (äŸ: ã¯ã©ãŠããããã°ããŒã¿ãä»®æ³åãã¹ãã¬ãŒãžãRDBMSãOracle ã SAP ãªã©ã® ERPãããã«ãŠã§ã¢ãã¯ã©ã¹ã¿ãªã³ã°ãé«å¯çšæ§)
UNIX ãŸã㯠Linux ã·ã¹ãã 管çãçµ±åããŸãã¯éçºã®çµéš
æè¿ããçµéšãã¹ãã«
æè¡çãªèª²é¡ã解決ããWorkshopããæ¥å課é¡ãæ¥åã¢ã€ãã¢åºãã®ãããªDiscovery Sessionã®éå¬çµéš
ã·ã¹ãã éšéã ãã§ãªããæ¥åéšéãªã©ã®ãè€æ°éšéãšã®ã客æ§ãšã®ã³ãã¥ãã±ãŒã·ã§ã³ãææ¡çµéš
Red Hat OpenShift/Red Hat Ansible Automation Platformã®ææ¡ãèšèšæ§ç¯ã®çµéš
xKS, k8s, AIã䜿ã£ãæ¥åã·ã¹ãã ã®ææ¡ãæ§ç¯çµéš
These jobs might be a good fit

Share
Primary Job Responsibilities:
Carry out the account strategy to increase performance and customer success in key Telco accounts, retaining and growing bookings through strategic account planning.
Collaborate with the team members to maximize RH business for Japan Telco accounts especially throughout experience and excellence to operate tools such as RHSC.
Required Skills
Solid understanding of Telco customer business, industry trends, competitive landscape, and Red Hatâs differentiators and value proposition.
Proven experience selling complex IT solutions to large organizations within the region and to multiple decision makers.
These jobs might be a good fit

What you will do:
Own the resilience testing roadmap for vLLM and llm-d: define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD
Design GPU/accelerator-aware fault experiments that target vLLM and the stack beneath it (drivers, GPU Operator/DevicePlugin, NCCL/collectives, storage/network paths, NUMA/topology)
Build an automated harness (preferably extending krkn-chaos (https://github.com/krkn-chaos/krkn) ) to run controlled experiments with scoped blast radius, and evidence capture (logs, traces, metrics)
Integrate fault signals into pipelines (GitHub Actions or otherwise) as resilience gates alongside performance gates
Develop detection and diagnostics: dashboards and alerts for pre-fault signals (e.g., vLLM queue depth, GPU throttling, P2P downgrades, KV-cache pressure, allocator fragmentation)
Triage and root-cause resilience regressions from field/customer issues; upstream bugs and fixes to vLLM and llm-d
Explore and experiment with emerging AI technologies relevant to software development and testing, proactively identifying opportunities to incorporate new AI capabilities into existing workflows and tooling.
Publish learnings (internal/external): failure patterns, playbooks, SLO templates, experiment libraries, and reference architectures; present at internal/external forums
What you will bring:
3+ years in reliability, and/or performance engineering on large-scale distributed systems
Expertise in systemsâlevel software design
Expertise with Kubernetes and modern LLM inference server stack (e.g., vLLM, TensorRT-LLM, TGI)
Observability & forensics skills with experience with Prometheus/Grafana, OpenTelemetry tracing, eBPF/BPFTrace/perf, Nsight Systems, PyTorch Profiler; adept at converting raw signals into actionable narratives.
Fluency in Python (data & ML), strong Bash/Linux skills
Exceptional communication skills - able to translate raw data into customer value and executive narratives
Commitment to openâsource values and upstream collaboration
The following is considered a plus:
Masterâs or PhD in Computer Science, AI, or a related field
History of upstream contributions and community leadership, public talks or blogs on resilience, or chaos engineering
Competitive benchmarking and failure characterization at scale.
The salary range for this position is $127,890.00 - $211,180.00. Actual offer will be based on your qualifications.
Pay Transparency
â Comprehensive medical, dental, and vision coverage
â Flexible Spending Account - healthcare and dependent care
â Health Savings Account - high deductible medical plan
â Retirement 401(k) with employer match
â Paid time off and holidays
â Paid parental leave plans for all new parents
â Leave benefits including disability, paid family medical leave, and paid military leave
These jobs might be a good fit