Senior Site Reliability Engineer

Engineering Platform Remote, United States


About the Team

O’Reilly Media’s Site Reliability Engineering team is made up of a diverse set of engineers tasked with monitoring, maintaining, and upgrading the Cloud Infrastructure and developer tooling that supports our online learning platform. The SRE team at O’Reilly is small and growing, so there is plenty of room for new members to shape the team’s vision and direction.

We’re a highly collaborative and supportive team that focuses on ensuring each member of the team is exposed to as much of our stack as possible. We believe in “raising the water level” so that each member of the team is given an opportunity to grow and to help others grow. 


About the Job

Site Reliability Engineers at O’Reilly work closely with both Product Engineers and Platform Engineers to ensure that our Online Learning Platform is always stable. Engineers in this position are expected to write and maintain infrastructure automation code, help developers troubleshoot issues with their microservices both in and out of production, contribute code to our microservice management platform, and more. As a senior member of the team you will be expected to not only function as an individual contributor but to help build the team’s roadmap by proposing new initiatives and seeing them through to completion. 

Your day-to-day workload will vary but here are some things the team has done or is doing:

  • Spinning up Argo Workflows to build a pipeline for running Terraform in Kubernetes
  • Updating the Nginx container image used by all our services to support Datadog APM
  • Helping with our migration from one CDN to another
  • Migrating our microservices from Nginx to Envoy to better leverage Istio
  • Decreasing spend by transferring our stateless workloads to preemptible GCE nodes

Job Details

  • Build tooling to enhance visibility into microservices performance and reliability
  • Write, maintain, and deploy the Terraform modules that define our cloud infrastructure
  • Influence the cloud architecture of our learning platform
  • Contribute code to our microservice management platform - the Chassis
  • Actively recommend improvements to company infrastructure and policy
  • Be part of a 24/7 on-call rotation and a 9-5 triage rotation
  • Document system and application engine configurations and procedures
  • Monitor systems, applications, services, and network performance/availability
  • Work with the team to help maintain the overall security of the Platform
  • Keep apprised of new developments in cloud solutions and educate other team members on related skills

About You

  • Proficient in  operating microservices and (managed) databases in production environments
  • Comfortable designing and writing  Front and/or Backend APIs
  • Deeply familiar with cloud service providers (GCP, AWS, or Azure)
  • Proven track record of  operating and building applications for Kubernetes clusters
  • Deep understanding of how to implement and utilize modern SaaS monitoring tools 
  • Excellent oral communication skills and good writing skills
  • Proficiency in at least two programming languages such as Python, Javascript, or Bash 
  • Knowledge of configuration management technologies such as Ansible or Chef

For Senior Site Reliability Engineers, we are interested in individuals with a deep understanding of cloud infrastructure and software development and a sincere interest in education. We desire conscientious candidates who work comfortably in an autonomous fashion and in a self-driven agile environment.  You should be willing and able to work with a small focused team to bring individual features to fruition, but also to work with the broader team of engineers to collaborate on initiatives that span the whole learning platform.  We value colleagues who are helpful, respectful, communicate openly, and are always willing to do what’s best for our users. Senior team members at O’Reilly are expected to be willing and capable mentors who can bring others up to their skill level.

We invite developers who value automated testing and welcome code reviews as an essential element of continuous learning. The people on our platform team have taken many traditional and non-traditional paths to the developer profession, and we welcome diverse teams that are bound together by a mutual love for learning.


Minimum Qualifications

  • 4-year college degree in Computer Science or related field, or combination of relevant education and experience
  • 5+ years experience in Linux System Administration
  • 5+ years of proven experience with Cloud Infrastructure
  • 1+ year of experience operating Kubernetes clusters in production environments

About O’Reilly Media

O’Reilly’s mission is to change the world by sharing the knowledge of innovators. For over 40 years, we’ve inspired companies and individuals to do new things—and do things better—by providing them with the skills and understanding that’s necessary for success.

At the heart of our business is a unique network of experts and innovators who share their knowledge through us. O’Reilly Learning offers exclusive live training, interactive learning, a certification experience, books, videos, and more, making it easier for our customers to develop the expertise they need to get ahead. And our books have been heralded for decades as the definitive place to learn about the technologies that are shaping the future. Everything we do is to help professionals from a variety of fields learn best practices and discover emerging trends that will shape the future of the tech industry.

Our customers are hungry to build the innovations that propel the world forward. And we help you do just that.

Learn more:


At O’Reilly, we believe that true innovation depends on hearing from, and listening to, people with a variety of perspectives. We want our whole organization to recognize, include, and encourage people of all races, ethnicities, genders, ages, abilities, religions, sexual orientations, and professional roles.

Learn more: