Share
What you will be doing:
Design and develop firmware solutions for manageability and observability of data center servers.
Actively participate in hardware bring-up activities, OOB firmware development, protocol stacks (Redfish, PLDM, MCTP, NSM) and hardware-software co-design for Cloud Service Provider deployments.
Debug and troubleshoot NVIDIA GPU firmware issues, power management, performance, and thermal control problems for data center deployments, providing active support to CSPs.
Partner directly with CSPs to deliver technical solutions, co-develop & co-debug features and optimizations, and provide support during new product introductions.
Perform advanced system debugging, root cause analysis, and performance optimization for large-scale data center environments.
Collaborate with AE, FAE, and Solution Architect teams to deliver integrated customer solutions and technical documentation.
What we need to see:
Deep expertise in data center server architectures, HPC systems, and hardware-software co-design.
Deep expertise in embedded firmware, server management controllers, and hardware bring-up with proven track record of shipping production BMC solutions
Strong knowledge of DMTF protocols (Redfish, IPMI, PLDM, MCTP, SPDM), telemetry frameworks, and out-of-band management architectures
Expert-level skills in C/C++ in resource-constrained embedded environments, RTOS, device drivers, and low-level protocols (I2C, SPI, UART, PCIe, MCTP).
Experience with RAS including error handling, error injection, fault isolatio, and system health monitoring.
BS or MS in Computer Engineering, Computer Science, or related field (or equivalent experience).
8-12 years of system software development experience.
Ways to stand out from the crowd:
Knowledge of cloud and cluster level deployment and management systems.
Experience with GPU computing (CUDA), deep learning workloads.
Knowledge of Memory fabric and CXL architectures.
You will also be eligible for equity and .
These jobs might be a good fit