Site Reliability Engineer

Engineering Platform Remote, United States


About Your team

O’Reilly Media’s Site Reliability Engineering team is made up of a diverse set of engineers tasked with monitoring, maintaining, and upgrading the Cloud Infrastructure and developer tooling that supports our online learning platform. The SRE team at O’Reilly is small and growing, so there is plenty of room for new members to shape the team’s vision and direction.

We’re a highly collaborative and supportive team that focuses on ensuring each member of the team is exposed to as much of our stack as possible. We believe in “raising the water level” so that each member of the team is given an opportunity to grow and to help others grow. 


About the Job

Site Reliability Engineers at O’Reilly work closely with both Product Engineers and Platform Engineers to ensure that our Online Learning Platform is always stable. Your work will focus on developing infrastructure automation code, contributing to our microservice management platform, and carrying out routine maintenance and upgrades to our internal services. Part of the job also includes being part of an on-call rotation and supporting other developers who are debugging issues in production. 

Here are some initiatives the team has taken underway recently: 

  • Spinning up a workflow engine to run Terraform from within Kubernetes
  • Instrumenting our sidecar proxies so they report APM data 
  • Helping with our migration from one CDN to another 
  • Decreasing spend by transferring our stateless workloads to preemptible GCE nodes
  • Replacing our current sidecar proxy with one that provides better telemetry 

Job Details

  • Build tooling to enhance visibility into microservices performance and reliability
  • Write, maintain, and deploy the Terraform modules that define our cloud infrastructure
  • Contribute code to our microservice management platform - the Chassis
  • Be part of a 24/7 on-call rotation and a 9-5 triage rotation
  • Document system and application engine configurations and procedures
  • Monitor systems, applications, services, and network performance/availability
  • Work with the team to help maintain the overall security of the Platform
  • Keep apprised of new developments in cloud solutions and educate other team members on related skills

About You

  • Experience operating microservices and (managed) databases in production environments
  • Able to design and write Front and/or Backend APIs
  • Deeply familiar with cloud service providers (GCP, AWS, or Azure)
  • Experience with operating and building applications for Kubernetes clusters
  • Understanding of how to implement and utilize modern SaaS monitoring tools 
  • Excellent oral communication skills and good writing skills
  • Proficiency in at least two programming languages such as Python, Javascript, or Bash 
  • Knowledge of configuration management technologies such as Ansible or Chef

For Site Reliability Engineers, we are interested in individuals with a deep understanding of cloud infrastructure and software development and a sincere interest in education. We desire conscientious candidates who work comfortably in an autonomous fashion and in a self-driven agile environment.  You should be willing and able to work with a small focused team to bring individual features to fruition, but also to work with the broader team of engineers to collaborate on initiatives that span the whole learning platform.  We value colleagues who are helpful, respectful, communicate openly, and are always willing to do what’s best for our users.

We invite developers who value automated testing and welcome code reviews as an essential element of continuous learning. The people on our platform team have taken many traditional and non-traditional paths to the developer profession, and we welcome diverse teams that are bound together by a mutual love for learning.


Minimum Qualifications

  • 4-year college degree in Computer Science or related field, or combination of relevant education and experience
  • 3+ years experience in Linux System Administration
  • 3+ years of proven experience with Cloud Infrastructure
  • 1+ year of experience operating Kubernetes clusters in production environments

About O’Reilly Media

O’Reilly’s mission is to change the world by sharing the knowledge of innovators. For over 40 years, we’ve inspired companies and individuals to do new things—and do things better—by providing them with the skills and understanding that’s necessary for success.

At the heart of our business is a unique network of experts and innovators who share their knowledge through us. O’Reilly Learning offers exclusive live training, interactive learning, a certification experience, books, videos, and more, making it easier for our customers to develop the expertise they need to get ahead. And our books have been heralded for decades as the definitive place to learn about the technologies that are shaping the future. Everything we do is to help professionals from a variety of fields learn best practices and discover emerging trends that will shape the future of the tech industry.

Our customers are hungry to build the innovations that propel the world forward. And we help you do just that.

Learn more:


At O’Reilly, we believe that true innovation depends on hearing from, and listening to, people with a variety of perspectives. We want our whole organization to recognize, include, and encourage people of all races, ethnicities, genders, ages, abilities, religions, sexual orientations, and professional roles.

Learn more: