Senior Systems Administrator - Machine Learning

Information Technology San Jose, California


Description

Tired of being a faceless engineer in a FAANG company? Stand out and be the lead engineer as Xperi expands into the exciting area of Machine Learning. We are currently seeking a motivated human to join our team in San Jose, CA as a Senior Systems Administrator for our Linux and Machine Learning infrastructure.

A technology company at its core, Xperi’s globally-distributed engineering teams’ application of Machine Learning tools is driving some of the most exciting innovations for the company to-date. As a newly-created role, this is a greenfield opportunity to work with our Infrastructure Team supporting DevOps and Engineering with responsibilities including deployment & build process automation, performance tuning and tools development.

You will be working closely with all engineers in the company to design, deploy and maintain Linux systems with a key focus on our machine learning environment as well as the general purpose Linux systems in the areas of build management, database, storage and applications. You must understand hardware design, software-stack and operating system interaction.

We're looking for a Systems Administrator who:

  • Has primarily supported machine learning or cluster computing environments using services such as Docker, Ansible and Kubernetes (K8s)
  • Has worked with x86 and nVidia hardware, can build and pull apart systems for testing, repairs or upgrades, e.g. what is the max torque applied to certain CPU coolers?
  • Is attentive to detail in implementation of solutions to complex problems
  • Is comfortable configuring Linux servers, including security and networking
  • Is an excellent communicator who can work with other team members and internal customers to resolve issues and develop new solutions
  • Is passionate about sharing knowledge and mentoring other members of their team
  • Fastidious with documentation and process

 
Responsibilities:

  • Proficient in system-level software, in particular hardware-software interactions and resource utilization & system profiling
  • Operate, maintain, and troubleshoot Linux systems within an enterprise-server environment
  • Linux systems performance tuning and root cause analysis
  • Deploy Linux systems with automated configuration tools such as Chef, Puppet, Ansible, etc. using containers and orchestration services such as Docker and Kubernetes (K8s)
  • Integrate services in a heterogeneous computing environment that includes on-prem and cloud services from a variety of vendors and technologies
  • Install, maintain and ensure the reliability and security of computer operating system software, third party products, and server hardware infrastructure components
  • Maintain and monitor production systems; understand the importance and can configure proper monitoring and alerting
  • Ability to provide off-hours coverage and support for maintenance during outage windows
  • Provide mentorship and training to other system administrator staff
  • Passionate about documentation! Document process and procedures to automate common tasks

 
What we value:

  • Hands on, bare-metal building experience using custom hardware
  • Machine learning experience: Café, Tensorflow, PyTorch, CUDA
  • Experience working on Linux servers in an Enterprise environment
  • Automated Systems Deployment/Management Expertise
  • Scripting and Programming Experience: Bash, Python, Perl, Ruby
  • Software Packaging Proficiency: RPM, deb, Python module, Docker
  • Experience designing and implementing secure distributed systems
  • Network stack and firewall knowledge
  • Database Experience
  • Experience integrating with various cloud infrastructure technologies: AWS, Azure, GCS, VMWare
  • Network design skills
  • HPC and cluster computing experience

Specific Job Requirements:

  • Occasional travel required