Linux Server GPU Engineer - TS/SCI

System Engineering Bethesda, Maryland


Description

Linux Server/NVidia Admin/ GPU Engineer - TS/SCI 

Xcelerate Solutions is seeking a Linux Server GPU Engineer position to support the National Media Exploitation Center (NMEC). This role requires an individual that has technical experience with administering Nvidia DGX1 and A100 servers within a within a physical and virtual environment.  This individual should be detail oriented in order to capture customer inquiries appropriately. This role is responsible for interacting with administrators to handle service inquiries and problems. Duties include examining customer problems and implementing appropriate corrective action to initiate a repair or return to service.  This role analyzes recurring problems and initiates solutions for preventing reoccurrence and analyzes existing infrastructure for tuning/performance enhancements. The individual will provide systems and software operations and maintenance support in a large, multi-enclave enterprise environment.  This individual will work in a team environment to ensure mission needs are met and ensure functionality of capabilities of customers. Individuals in this role may be required to perform technical software configuration, rebooting, and other remedial actions on customer servers.  The Customer utilizes an Agile Framework to plan and successfully complete all initiatives.  The work location is in Bethesda at the Intelligence Community Campus.

Security Clearance:
TS/SCI 

Location:
Bethesda, MD 
 

Responsibilities: 

  • GPU Architecture and Design: Collaborate with a multidisciplinary team to define, develop, and optimize GPU architectures, ensuring they meet stringent performance, power efficiency, and feature requirements. Leverage industry insights to drive design decisions. Ensure that GPU designs and integrations are not only optimized for Linux but are also adaptable to other operating systems.
  • Operating System Integration: Work closely with operating system developers to ensure smooth GPU integration with Linux-based systems. Optimize GPU drivers for compatibility, performance, and reliability in a Linux environment. Provide regular maintenance and updates to ensure continued compatibility.
  • Hardware Expertise: Contribute to the design and development of GPU hardware, providing insights into hardware architecture to ensure efficient interaction with software components. Maintain and update hardware designs as needed.
  • CUDA (Compute Unified Device Architecture) /OpenCL (Open Computing Language) Programming: Develop and optimize applications using CUDA or OpenCL, harnessing the full potential of GPU hardware for parallel processing, high-performance computing, and machine learning on Linux platforms. Maintain and update software for optimal performance.
  • Performance Analysis: Analyze GPU performance, identify bottlenecks, and develop strategies to enhance performance across various applications in Linux, addressing both hardware and software considerations. Regularly monitor and improve performance.
  • GPU Tooling: Create and maintain debugging tools, profiling utilities, and performance analysis software tailored for Linux systems to facilitate efficient GPU development and troubleshooting. Keep tools up-to-date and functional.
  • Power Efficiency: Work on power management techniques to optimize GPU power consumption, ensuring efficient operation on both mobile and desktop Linux platforms. Continuously assess and enhance power efficiency strategies.
  • Testing and Validation: Design and execute tests to validate GPU performance and functionality on Linux, including stress testing, benchmarking, and debugging to ensure robust operation. Maintain and expand the testing suite.
  • Documentation: Maintain comprehensive technical documentation, including architectural specifications, code documentation, and Linux-specific best practices for GPU development. Keep documentation up-to-date with changes and improvements.
  • Industry Insight: Stay updated on the latest trends, innovations, and competitive landscapes within the GPU industry, contributing to research efforts and proposing Linux-specific approaches to GPU design and optimization. Share regular updates and insights with the team.

Minimum Requirement  

  • Bachelor's or higher degree in Computer Science, Electrical Engineering, or a related field. Additional years of experience may be considered in lieu of a degree.
  • 10+ years of relevant systems engineering experience
  • Proven experience in GPU architecture design, and GPU performance optimization.
  • Expertise in operating system integration for Linux.
  • Strong understanding of computer hardware architecture, particularly as it relates to Linux systems.
  • Knowledge of parallel computing, graphics algorithms, and real-time rendering in Linux environments.
  • Familiarity with GPU debugging tools and profiling software for Linux.
  • Excellent problem-solving skills and the ability to collaborate within a team.
  • Strong communication skills for conveying technical information in a Linux context.
  • Proficiency with scripting languages such as Python or BASH.
  • Proficiency with automation tools such Ansible, Puppet, Salt, Terraform, etc.
  • Candidate must, at a minimum, meet DoD 8570.11- IAT Level II certification requirements (currently Security+ CE, CCNA-Security, GICSP, GSEC, or SSCP along with an appropriate computing environment (CE) certification). An IAT Level III certification would also be acceptable (CASP+, CCNP Security, CISA, CISSP, GCED, GCIH, CCSP).

Preferred Qualification 

  • Published research or contributions in the GPU industry, especially related to Linux.
  • Experience with machine learning and neural network frameworks on GPUs in Linux.
  • Knowledge of GPU virtualization, cloud computing, and emerging Linux-based technologies in the field.
  • Proficiency in programming languages such as GPU-specific languages.
  • Experience with container technologies (Docker, Kubernetes)
  • Experience with Prometheus/Grafana for monitoring
  • Knowledge of distributed resource scheduling systems [Slurm (preferred), LSF, etc.]
  • Familiarity with CUDA and managing GPU-accelerated computing systems
  • Basic knowledge of deep learning frameworks and algorithms

About Xcelerate Solutions:  

Founded in 2009 and headquartered in McLean, VA, Xcelerate Solutions (www.xceleratesolutions.com) is one of America's fastest-growing companies. Xcelerate’s culture is defined by our diversified workforce of dynamic and versatile professionals, supported with growth and development opportunities that contribute to individual and company growth. This strong commitment to our employees has been recognized by our inclusion on the Washington Business Journal’s “50 Best Places to Work” list as well as being a “Great Place to Work” certified company with a 4.6 star, and a 99% CEO approval Glassdoor rating. Come find out why Xcelerate Solutions is one of the DC Metro top employers! 

 Xcelerate Solutions is an Equal Employment Opportunity/Affirmative Action Employer. We evaluate qualified applicants without regard to race, color, national origin, religion, age, equal pay, disability, veteran status, sex, sexual orientation, gender identity, genetic information, or expression of another protected characteristic. As part of this commitment to the full inclusion of all qualified individuals, Xcelerate provides reasonable accommodations if needed because of an applicant's or an employee's disability. 

 Pay Transparency Notice: Xcelerate Solutions will not discharge or in any other manner discriminate against employees or applicants because they have inquired about, discussed, or disclosed their own pay or the pay of another employee or applicant.