Cloud Operations Engineer - Observability
Description
Join us as we pursue our ground-breaking vision to make machine data accessible, usable, and valuable to everyone. We are a company filled with people who are passionate about our product and seek to deliver the best experience for our customers. At Splunk, we are committed to our work, customers, having fun, and most significantly to each other’s success.
The Splunk Observability Cloud provides full-fidelity monitoring and fixing across infrastructure, applications, and user interfaces, in real-time and at any scale, to help our customers keep their services reliable, innovate faster, and deliver great customer experiences.
Role
You will help us run one of the largest and most sophisticated cloud-scale, bigdata, and microservices platforms in the world. You will be responsible to monitor and resolve issues that affect the availability and performance of critical components of Splunk Observability Cloud. You will use your Kubernetes, cloud, and infrastructure-as-code knowledge to enhance Splunk Observability Cloud infrastructure while reducing its operational costs.
As such you will be providing on-call support & incident management for our customers. To ensure coverage, you will work a 40 hour Mon-Fri week and be available for production support on a rotating basis on either a Saturday or/& Sunday. The flexible rotating roster is intended to balance employee well-being and business requirements to ensure customer expectations are met.
Responsibilities:
- Respond to monitoring alerts according to defined playbooks and procedures.
- Enhance playbooks and procedures to reduce on-call toil.
- Participate in Post Incident Reviews and discussions.
- Ensure stability and performance of production environments.
- Deploy software to production environments.
- Build effective working relationships with cross-functional team members
- Make suggestions for process improvements and enhance operational efficiencies.
- Implement various process improvements and operational efficiencies.
Qualifications:
- 5+ years related experience in Cloud Operations.
- You have experience with Cloud Computing Platforms, such as AWS and GCP.
- You have experience with Kubernetes and Docker.
- You have experience with one or more scripting languages, such as Python, Bash, etc.
- You have 2+ years in incident response and major incident management.
- You enjoy problem-solving and analyzing global-scale distributed systems.
- You are collaborative with strong interpersonal and communication skills, both verbal and written.
- You remain calm and collected in stressful situations, such as a major service outage.
- You demonstrate attention to detail, follow-through, and the ability to prioritize quickly.
- You demonstrate good judgment on when to solve problems individually and when to involve others.
- Experience in Infrastructure-as-code - Terraform, Helm, YAML.
Nice to have:
- Experience handling SaaS applications for a large customer base.
- Experience with CI/CD frameworks and Pipeline-as-Code such as Jenkins, Gitlab, Artifactory, etc.
- Familiarity with microservices fundamentals including Service Mesh using Istio, service discovery, deployment strategies, monitoring, scheduling, and load balancing.
We are an equal-opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
We value diversity at our company. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, or any other applicable legally protected characteristics in the location in which the candidate is applying.
Note:
Thank you for your interest in Splunk!