Linux Server/NVidia Admin/GPU Engineer - TS/SCI

System Engineering Bethesda, Maryland


Description

Linux Server/NVidia Admin/ GPU Engineer - TS/SCI

Xcelerate Solutions is seeking a Linux Server/NVidia GPU Engineer position to support the National Media Exploitation Center (NMEC). This role requires an individual that has technical experience with administering Nvidia DGX1 and A100 servers within a within a physical and virtual environment.  This individual should be detail oriented in order to capture customer inquiries appropriately. This role is responsible for interacting with administrators to handle service inquiries and problems. Duties include examining customer problems and implementing appropriate corrective action to initiate a repair or return to service.  This role analyzes recurring problems and initiates solutions for preventing reoccurrence and analyzes existing infrastructure for tuning/performance enhancements. The individual will provide systems and software operations and maintenance support in a large, multi-enclave enterprise environment.  This individual will work in a team environment to ensure mission needs are met and ensure functionality of capabilities of customers. Individuals in this role may be required to perform technical software configuration, rebooting, and other remedial actions on customer servers.  The Customer utilizes an Agile Framework to plan and successfully complete all initiatives.  The work location is in Bethesda at the Intelligence Community Campus.

Security Clearance:
TS/SCI 

Location:
Bethesda, MD 

Responsibilities:

  • Review C&A documentation providing feedback on completeness and compliance of its content
  • Perform system installation, configuration maintenance, account maintenance, signature maintenance, patch management, and troubleshooting of operational IA and CND systems
  • Operates with appreciable latitude in developing methodology and presenting solutions to problems.
  • Contributes to deliverables and performance metrics where applicable.
  • Responsible for implementing, operating, and maintaining physical and virtual server hardware and systems software.
  • Provide technical support, administration, and monitoring of Linux systems within a physical and virtual environment.
  • Decommission server(s) upon request through approved engineering change solution.
  • Install and/or remove service application and scripting components
  • Maintain server hardware and software assets in compliance with supporting third party vendor and lease requirements.
  • Implement capacity planning and tuning actions for server assets
  • Provide support for the implementation, troubleshooting and maintenance of IT systems. Rapidly distinguish isolated user problems from enterprise-wide application/system problems.
  • Support operations, maintenance and troubleshooting for end user workstations, to include desktop applications
  • Coordinate with customers and stakeholders to collect data, conduct analysis, develop, and implement solutions associated with incident tickets and requirements.
  • Seek opportunities for continuous improvement to support effective and efficient operations
  • Develop solutions to complex technical issues.
  • Provide follow-up reports (technical findings, feedback, resolution steps taken) for Root Cause analysis, engineering technical assessment and process improvement initiatives.
  • Support customer requirements in a 24/7/365 environment and be able to provide on-call support during outages occurring after hours; may involve shift work.
  • Update operations and monitoring documentation for 24/7/365 Operations Watch personnel
  • Other duties as assigned; associated with and/or in support of your primary role or program mission
Minimum Requirement 
  • Requires a bachelor’s degree and 10+ years of relevant experience, additional years of experience may be considered in lieu of a degree 
  • Experience supervising others
  • 2 years of Unix administration experience, including Red Hat/CentOS (or derivative) and Ubuntu administration
  • System security engineering expertise in one or more of the following: system security design process; engineering life cycle; information domain; cross domain solutions; commercial off-the-shelf and government off-the-shelf cryptography; identification; authentication; and authorization; system integration; risk management; intrusion detection; contingency planning; incident handling; configuration control; change management; auditing; certification and accreditation process; principles of IA (confidentiality, integrity, non-repudiation, availability, and access control); and security testing
  • Possesses and applies expertise on multiple complex work assignments. Assignments may be broad in nature requiring originality and innovation in determining how to accomplish tasks. 
  • Hands on experience identifying server hardware failures, including hard drives and memory
  • Experience with cluster configuration management tools such as Ansible, Salt
  • Strong knowledge of DNS, NFS, LDAP, and DHCP services
  • Experience with shell scripting and/or Python to automate repetitive administration tasks
  • Background in Linux server setup, deployment and maintenance 
  • Experience with hardening Linux environments 
  • Experience with system administration of server operating systems such as Linux (CentOS, RHEL, or Ubuntu) 
  • Experience troubleshooting issues in a growing environment 
  • Experience with log reviews, incident analysis, and identification of issue trends 
  • Experience with server patch management methodologies 
  • Time management skills with the ability to work within an IT Service Management/ticketing system independently  
  • Ability to triage and properly classify incidents and prioritize work efforts accordingly 
  • Strong oral and written communications skills 
  • Experience establishing goals and plans that meet project objectives 
  • Track record of working effectively within a team, and support to peers toward improved processes and results 
  • Candidate must, at a minimum, meet DoD 8570.11- IAT Level II certification requirements (currently Security+ CE, CCNA-Security, GSEC, or SSCP along with an appropriate computing environment (CE) certification) 
Preferred Qualification
 

  • Experience with container technologies (Docker, Kubernetes)
  • Experience with Prometheus/Grafana for monitoring
  • Knowledge of distributed resource scheduling systems [Slurm (preferred), LSF, etc.]
  • Familiarity with CUDA and managing GPU-accelerated computing systems
  • Basic knowledge of deep learning frameworks and algorithms

About Xcelerate Solutions:

Founded in 2009 and headquartered in McLean, VA, Xcelerate Solutions (www.xceleratesolutions.com) is one of America's fastest-growing companies. Xcelerate’s culture is defined by our diversified workforce of dynamic and versatile professionals, supported with growth and development opportunities that contribute to individual and company growth. This strong commitment to our employees has been recognized by our inclusion on the Washington Business Journal’s “50 Best Places to Work” list as well as being a “Great Place to Work” certified company with a 4.6 star, and a 99% CEO approval Glassdoor rating. Come find out why Xcelerate Solutions is one of the DC Metro top employers!

 Xcelerate Solutions is an Equal Employment Opportunity/Affirmative Action Employer. We evaluate qualified applicants without regard to race, color, national origin, religion, age, equal pay, disability, veteran status, sex, sexual orientation, gender identity, genetic information, or expression of another protected characteristic. As part of this commitment to the full inclusion of all qualified individuals, Xcelerate provides reasonable accommodations if needed because of an applicant's or an employee's disability.

 Pay Transparency Notice: Xcelerate Solutions will not discharge or in any other manner discriminate against employees or applicants because they have inquired about, discussed, or disclosed their own pay or the pay of another employee or applicant.