Manager, Splunk Incident Response Team

Engineering Sydney, Australia


Join us as we pursue our disruptive new vision to make machine data accessible, usable and valuable to everyone. We are a company filled with people who are passionate about our products and seek to deliver the best experience for our customers. At Splunk, we’re committed to our work, customers, having fun and most importantly to each other’s success. Learn more about Splunk careers and how you can become a part of our journey!

Splunk Cloud is looking for a Manager to provide day-to-day leadership to our Incident Response Team (SIRT). This position is responsible for the Incident management process design and continuous service Improvements as vital to achieve the objectives of the business. As manager of the Incident Response Team, you'll lead a team responsible for the 24/7 response to incidents of our rapidly growing Cloud Platform. You'll use analytics to plan, implement and continually improve processes that lead to an improvement in overall MTTR. We're looking for someone to bring a fresh approach to problems of all shapes and sizes and help us build a best-in-class Incident Response Team.

Responsibilities:

  • Solve issues and participate in on-call support, ensuring stability and performance of the Splunk Cloud environment.
  • Partner with our SRE teams to deliver agile, highly automated capabilities to monitor applications and our cloud infrastructure.
  • Drive automation (of runbooks) and software-defined approaches to reliability and availability as well as change management.
  • Work Closely with various groups within Cloud Operations to drive efficiencies. Including authoring of runbooks and key alert metrics, and overall health and stability of monitoring.
  • Represent the Incident Response Team in meetings/process changes and make recommendations on new procedures/ processes.
  • Work with your peers across the organization to handle related or dependent release activities.
  • Act as a Liaison between SRE, monitoring teams, support and leadership for new processes, tools. and knowledge transfers.
  • Oversee all Incident Commanders and leads and ensure all duties and tasks are being performed expertly and effectively during each shift.
  • Mentor and coach new team members
  • Provide Incident commander responsibilities, contribute to post incident review, and follow through with action plans

Who you are:

  • 2-4 years in hands-on manager position.
  • Deep understanding of Cloud (AWS, Azure, GCP).
  • Experienced in Systems Administration or Technical Operations
  • Hands-on experience maintaining and troubleshooting Linux/UNIX servers in a production environment.
  • Strong knowledge of and experience with Config management
  • Collaborative with outstanding social and interpersonal skills.
  • Calm and collected in stressful situations, such as a major service outage.
  • Take charge, personality, and the ability to drive a plan to completion.
  • Comfortable working in a dynamic environment with a highly technical team.
  • Demonstrated attention to detail, follow through, and ability to prioritize quickly are necessary.

We value diversity at our company. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, or any other applicable legally protected characteristics in the location in which the candidate is applying.

Thank you for your interest in Splunk!