Principal Site Reliability Engineer Cloud Platform

Engineering San Jose, California

Principal Site Reliability Engineer, Cloud Platform

Join us as we pursue our disruptive vision to make machine data accessible, usable and valuable to everyone. We are a company filled with people who are passionate about our product and seek to deliver the best experience for our customers. At Splunk, we’re committed to our work, customers, having fun and most meaningfully to each other’s success. Learn more about Splunk careers and how you can become a part of our journey!


Splunk's Cloud group is looking for an expert Principal SRE to help lead, design and build the next generation of our large scale Cloud offering. You will be working on the core platform in the cloud.

You will:

  • Work across the organization to deliver quality products that delight Splunk's passionate users.
  • Lead teams of tight-knit, super smart engineers who are building a state-of-the-art, cloud-based environment for massive-scale data processing.
  • Mentor and help new engineers to achieve more than they thought possible.


  • You are passionate about building and running distributed systems at scale in production. You understand the challenges and trade-offs to be made when building and deploying systems to production.
  • You are an expert in working with container deployment and orchestration technologies at scale with strong knowledge of the fundamentals to include service discovery, deployments, monitoring, scheduling, load balancing. 
  • You have a deep understanding of Systems architecture (network stack, file system, OS services, storage subsystems) and have implemented features around these.  
  • You have experience implementing reliability features into application code.  This may include emitting metrics or other health indicators, survivability, or multi-site features.  
  • You've demonstrated the ability to effectively work collaboratively across functions.
  • You are enthusiastic about making the many users of your product happier every day.
  • You make decisions based on measurable data.  You evangelize this cycle of measurement, experimentation, and improvement to other teams.  
  • You understand how services scale, fail, and recover.  You recommend architectural changes and work directly with applications to implement your designs.  
  • You are passionate about reliability as a feature and solving reliability challenges across the organization.  

Preferred skills:

  • Experience with running multi cluster environments and strong understanding of multi-tenancy and security implications.
  • Knowledge of Kubernetes, Go and Docker
  • Experience with development and deployment in a hosted cloud environment, preferably AWS.

We value diversity at our company. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, or any other applicable legally protected characteristics in the location in which the candidate is applying.

For job positions in San Francisco, CA, and other locations where required, we will consider for employment qualified applicants with arrest and conviction records. 

Thank you for your interest in Splunk!