GPU Cluster Infrastructure Engineer - Remote

Engineering Virtual, United States San Francisco, California


Description

GPU Cluster Infrastructure Engineer
Location:  Remote or Bay Area, CA
Compensation:  $125K - $150K
 
Our client is focused on developing a high-performance multi-technology, vendor-independent, and xPU-based Accelerated Cloud Computing platform. Stacking massive clusters purpose-built for high-performance parallel computing, the group also aims to launch a global accelerated cloud solution. In addition to the above, our client will focus on broader Artificial General Intelligence (AGI) products, supercomputing services, and end-to-end AI engineering services.
 
They're seeking a talented, self-driven, and passionate GPU Cluster Infrastructure Engineer (AGI infrastructure Engineer) to join their team and help architect a next-gen future for their global customers.  Ideally, this person would sit in the Bay Area but they are open to anyone that is remote that would be willing to travel to the Bay Area as needed. 
 
As the GPU Cluster Infrastructure Engineer, you will:
 
  • Design and implement innovative hardware solutions for highly scalable and eKicient xPU PODs. 
  • Collaborate with architects, software engineers, and system engineers to ensure optimal integration of hardware and software components within the PODs 
  • Deeply understand leading xPU architectures from NVIDIA, AMD, and/or Intel and leverage their capabilities for performance optimization within PODs 
  • Participate in the development and execution of hardware verification and validation plans 
  • Stay up to date on the latest advancements in xPU technology and related hardware trends
  • Contribute to technical documentation and maintain clear communication within the team. 
 
 
As the GPU Cluster Infrastructure Engineer, your background should include:
 
  • Master’s degree in Electrical Engineering, Computer Engineering, or a related field (or equivalent experience)  
  • Minimum 5+ years of experience in designing and developing hardware solutions, preferably for data center or high-performance computing environments 
  • Proven experience with virtualization platforms, preferably including:  
    • VMware (vSphere, ESXi, etc.) OR Nutanix (AHV, AOS, etc.) 
    • Strong understanding of hypervisor technologies and their functionalities 
    • Ability to integrate and manage both internal and external virtualization platforms 
 
  • Must have experience developing and running applications using the ROCm platform with strong understanding of ROCm components like HIP, OpenCL, and AMD GPU architecture
  • In-depth knowledge of xPU architectures, particularly from NVIDIA, AMD, or Intel.
  • Completed certifications in NVIDIA AI in Datacenter and InfiniBand OR C-DAC certification.
  • Strong understanding of computer architecture, memory systems, and interfacing techniques.  
  • Solid understanding of OpenStack concepts and experience managing cloud infrastructure  
  • Prior experience in building and operating Cloud POD infrastructure
  • Proficiency in hardware description languages (HDL) like Verilog or VHDL  
  • Experience with hardware simulation and verification tools
  • Excellent communication, collaboration, and problem-solving skills
  • A passion for innovation and a drive to contribute to cutting-edge technological development.
 
If interested, please send your resume to [email protected].