GPU Cluster Infrastructure Engineer - Remote

Engineering Virtual, United States San Francisco, California

Description

GPU Cluster Infrastructure Engineer

Location: Remote or Bay Area, CA

Compensation: $125K - $150K

Our client is focused on developing a high-performance multi-technology, vendor-independent, and xPU-based Accelerated Cloud Computing platform. Stacking massive clusters purpose-built for high-performance parallel computing, the group also aims to launch a global accelerated cloud solution. In addition to the above, our client will focus on broader Artificial General Intelligence (AGI) products, supercomputing services, and end-to-end AI engineering services.

They're seeking a talented, self-driven, and passionate GPU Cluster Infrastructure Engineer (AGI infrastructure Engineer) to join their team and help architect a next-gen future for their global customers. Ideally, this person would sit in the Bay Area but they are open to anyone that is remote that would be willing to travel to the Bay Area as needed.

As the GPU Cluster Infrastructure Engineer, you will:

Design and implement innovative hardware solutions for highly scalable and eKicient xPU PODs.
Collaborate with architects, software engineers, and system engineers to ensure optimal integration of hardware and software components within the PODs
Deeply understand leading xPU architectures from NVIDIA, AMD, and/or Intel and leverage their capabilities for performance optimization within PODs
Participate in the development and execution of hardware verification and validation plans
Stay up to date on the latest advancements in xPU technology and related hardware trends
Contribute to technical documentation and maintain clear communication within the team.

As the GPU Cluster Infrastructure Engineer, your background should include:

Master’s degree in Electrical Engineering, Computer Engineering, or a related field (or equivalent experience)
Minimum 5+ years of experience in designing and developing hardware solutions, preferably for data center or high-performance computing environments
Proven experience with virtualization platforms, preferably including:

VMware (vSphere, ESXi, etc.) OR Nutanix (AHV, AOS, etc.)
Strong understanding of hypervisor technologies and their functionalities
Ability to integrate and manage both internal and external virtualization platforms

Must have experience developing and running applications using the ROCm platform with strong understanding of ROCm components like HIP, OpenCL, and AMD GPU architecture
In-depth knowledge of xPU architectures, particularly from NVIDIA, AMD, or Intel.
Completed certifications in NVIDIA AI in Datacenter and InfiniBand OR C-DAC certification.
Strong understanding of computer architecture, memory systems, and interfacing techniques.
Solid understanding of OpenStack concepts and experience managing cloud infrastructure
Prior experience in building and operating Cloud POD infrastructure
Proficiency in hardware description languages (HDL) like Verilog or VHDL
Experience with hardware simulation and verification tools
Excellent communication, collaboration, and problem-solving skills
A passion for innovation and a drive to contribute to cutting-edge technological development.

If interested, please send your resume to [email protected].

Apply Apply Later

← Back to Current Openings

PeopleConnect Staffing Careers

GPU Cluster Infrastructure Engineer - Remote

Description

Similar Jobs

GPU Cluster Infrastructure Engineer - Remote

Description

Share

Similar Jobs