Senior HPC Engineer/Administrator (IMC - 001)
Description
Serving Maryland and the Greater Washington D.C. area, SageCor Solutions (SageCor) is a growing company bringing complete engineering services and true full lifecycle System Engineering services to areas requiring (or desiring) nationally-recognized expertise in high performance computing, large data analytics and cutting edge information technologies.
Active TS/SCI w/ Polygraph required.
The Systems Administrator will be responsible for providing system administration and technical support of traditional and High-Performance Computing (HPC) systems in a research-driven environment.
Requirements:
• Configure and manage Linux and Windows (or other applicable) operating systems and installs/loads operating system software, troubleshoot, maintain integrity of and configure network components, along with implementing operating systems enhancements to improve security, reliability, and performance
• Administer, monitor, and maintain HPC systems, including compute nodes, storage, networking, and software stacks
• Provide support to IT systems including day-to-day operations, monitoring and problem resolution for all of the client/server/storage/network devices, mobile devices, etc.
• Implement and maintain automation tools for system provisioning, configuration management, and monitoring.
• Provide support for implementation, troubleshooting and maintenance of IT systems
• Manage the daily activities of configuration and operation of IT systems
• Provide assistance to users in accessing and using IT systems
• Optimize system operations and resource utilization, and perform system capacity analysis and planning
• Provide in-depth experience in trouble-shooting IT systems
• Analyze and resolve complex problems associated with server hardware, applications and software integration
• Contribute to performance benchmarking, system tuning, and capacity planning
• Support researchers by providing technical expertise and resolving IT-related roadblocks or issues
• Document system administration procedures and contribute to knowledge-sharing initiatives
• Administer, monitor, and maintain HPC systems, including compute nodes, storage, networking, and software stacks
• Provide support to IT systems including day-to-day operations, monitoring and problem resolution for all of the client/server/storage/network devices, mobile devices, etc.
• Implement and maintain automation tools for system provisioning, configuration management, and monitoring.
• Provide support for implementation, troubleshooting and maintenance of IT systems
• Manage the daily activities of configuration and operation of IT systems
• Provide assistance to users in accessing and using IT systems
• Optimize system operations and resource utilization, and perform system capacity analysis and planning
• Provide in-depth experience in trouble-shooting IT systems
• Analyze and resolve complex problems associated with server hardware, applications and software integration
• Contribute to performance benchmarking, system tuning, and capacity planning
• Support researchers by providing technical expertise and resolving IT-related roadblocks or issues
• Document system administration procedures and contribute to knowledge-sharing initiatives
Technical skills:
• Experience administering Linux-based servers and HPC clusters, including job schedulers (e.g., Slurm, LSF, PBS)
• Experience configuring and managing Virtual Private Network (VPN) clients and servers
• Scripting/programming skills (C and Python)
• Knowledge of:
o System automation tools (e.g., Ansible)
o System provisioning tools (e.g., Warewolf)
o Distributed storage systems (e.g., Lustre, BeeGFS)
o Containerization (e.g., Docker, Apptainer)
o Installing, maintaining and using infrastructure and performance monitoring and optimization tools (e.g., Grafana, Prometheus)
o Setting up and executing benchmarks in an HPC environment and analyzing their results systematically
• Experience configuring and managing Virtual Private Network (VPN) clients and servers
• Scripting/programming skills (C and Python)
• Knowledge of:
o System automation tools (e.g., Ansible)
o System provisioning tools (e.g., Warewolf)
o Distributed storage systems (e.g., Lustre, BeeGFS)
o Containerization (e.g., Docker, Apptainer)
o Installing, maintaining and using infrastructure and performance monitoring and optimization tools (e.g., Grafana, Prometheus)
o Setting up and executing benchmarks in an HPC environment and analyzing their results systematically
Qualifications:
• Active Top Secret/SCI clearance with polygraph
• Preferably meets DoD 8140.01 or DoD 8570.01-M training and certification requirements
• Preferably meets DoD 8140.01 or DoD 8570.01-M training and certification requirements
Consistent with federal and state law where SageCor conducts business, SageCor Solutions provides equal employment opportunities (EEO) to all employees and applicants for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability or veteran status, or any other protected class.