Site Reliability Engineering Manager | 23233
Job Title: Site Reliability Engineering Manager
Location: San Diego, CA
ServiceNow, The Enterprise IT Cloud Company, is the industry-leading cloud platform provider for building enterprise applications. We are redefining markets and changing the perception of enterprise software. Our cloud platform allows enterprise IT to bring together business strategy, application design and operations in a powerfully simple solution.
To sustain our explosive growth, we are looking for drivers—people who thrive on responsibility and live for the next big challenge. We seek to employ the brightest and most forward-thinking talent on the planet, and we want to hire people who have their best work ahead of them, not behind them. Accelerate your career and succeed in an environment where you can make an impact daily. We invite you to join in to stand out.
ServiceNow is seeking an experienced individual to lead Infrastructure Site Reliability Engineering in the US, a team of software, system, and network engineers with a broad set of expertise to operate and maintain ServiceNow cloud Infrastructure and its components. The SRE is responsible for service availability to our customer base that operates on our cloud. This team partners with functional Engineering / Infrastructure departments and Support teams to ensure our service is operable and maintainable in the most efficient and effective manner.
The SRE operates 24/7/365 providing global coverage with presence in UK, AMEA and US and monitors the health of our cloud, providing value-added troubleshooting and outage management across all infrastructure and application issues. The team serves as the front line of infrastructure support through event monitoring, incident response and mitigation, providing world-class service to our cloud customer base. The team is responsible for the continued high availability and reliability of our SaaS platform and all infrastructure elements that support the environment. This team covers the day shift in the US from 8 am to 5 pm Pacific Time.
The candidate will be responsible for developing requirements, workflows, and communications pertaining to all incidents and escalations in the live environment that span the SRE. Candidates must maintain professional communications with external teams to ensure timely restoration of services and the continuous improvement of the services delivered by the SRE and by ServiceNow.
Reporting to the Director of Site Reliability Engineering, the successful candidate ensures that terabytes of data and all cloud services are highly available 24x7, reliable and scalable for our rapidly growing cloud services. The SRE Manager provides the SRE staff with direct management and incident leadership, including prioritization of all site reliability engineering and sustained operations efforts related to projects, tasks and goals. In addition, the SRE Manager will lead continuous improvement activity and drive the continuing development of the team and individuals within the team.
- Team Management
o A successful candidate will directly manage the US Site Reliability Engineering team. Duties include performance reviews, objective setting, work prioritization and overseeing all staff activity including tasks, projects, and goals. In addition, this role establishes career paths and implements training programs with partner teams and the Training Partner to promote career development within Site Reliability Engineering.
o This position is responsible for all incidents and escalations as they pertain to the SRE and the associated process and workflows with particular focus on maintaining the performance and availability of the supported environments. The candidate must participate in the continued development and execution of SRE management processes including Incident, Problem, Configuration and Change management.
o The role is accountable for effectively on-boarding and preparing all of Site Reliability Engineering for new technology, systems and automation that will be supported or used by the Site Reliability Engineering team, while ensuring that all documentation is up to date.
- Process and Procedures
o A successful candidate will ensure appropriate rigor, discipline, consistency and predictability is applied across the entire organization with respect to how changes are scheduled, executed and measured.
o The SRE Manager will analyze current procedures and processes and drive continuous improvements efforts to ensure the SRE provide a quality service across all functional areas. This will include the setting up and continuous monitoring of KPI’s and metrics pertaining to individual, team, platform and service performance.
o The SRE Manager also will provide documentation and training to internal departments to facilitate day-to-day operations throughout the company and define, share and deliver insightful analysis across all metrics for the Operations teams.
- Inter-Departmental / Customer Interaction
o A successful candidate will act as the incident and crisis manager in situations that require orchestration of effort between multiple teams to resolve time-critical situations. This position will operate as one of a few select Incident Commanders for Crisis situations. The Incident Commander is tasked with leading and directing relief and resolution during a crisis situation while sharing this information with other departments and, in certain cases, customers.
o The role is responsible for conducting formal and informal training sessions with new-hires, and educating individuals outside of Operations regarding team mission, charter and strategies. The SRE Manager will provide supplemental SRE support for US-Federal Teams, as needed and appropriate, and deliver internal communications regarding processes and procedures as it relates to the Site Reliability Engineering function.
- Highly experienced in hands-on operations in a technical setting, with a responsibility for personnel management, managing Operations with a thorough understanding of Events, Monitoring, Alerts, Incidents, Outages, Crises, Change, Configuration and Problem processes.
- Strong understanding or experience of operating in Cloud Operations as it pertains to Software as a Service (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service (IaaS)
- Strong working knowledge of operating in a follow-the-sun operational model, including geographic knowledge, talent acquisition, cultural dynamics, and cross-shift handovers and communications.
- Comprehensive knowledge of principles, methods, and techniques used across ITIL processes, preferably ITIL v3.
- Outstanding communication skills, both written and verbal, and very strong interpersonal skills.
- A working understanding of the technology associated with operating a service or platform in the cloud, including datacenter, networking, application and relational databases
- Participation in an on-call rotation for incident and crisis management responsibility.
- Familiarity with monitoring tools similar to Nagios, Cacti, Prometheus, Xymon, SMARTS etc.
- Ability to multi-task in a fast-paced environment.
- Attention to detail and the ability to work independently and lead a team.
- Bachelor's degree in Computer Science or Information Systems or equivalent technical discipline, or similar work experience.
- RHCE or other Unix / Linux Certification
- CCNA or relevant certification
- Previous exposure to configuration management
- ITIL Certification
WORKING FOR SERVICENOW:
We are a dynamic and rapidly growing software company with a strong sense of dedication to our customers. We work hard but try not to take ourselves too seriously. This is a very collaborative and inclusive work environment where individuals strong on aptitude and attitude will have an opportunity to grow their professional careers through working with some of the most advanced technology and talented developers in the business. We provide competitive compensation, generous benefits, and a professional, yet relaxed atmosphere.
ServiceNow is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, national origin, age, disability, gender identity, or veteran status. If you are an individual with a disability and require a reasonable accommodation to complete any part of the application process, or are limited in the ability or unable to access or use this online application process and need an alternative method for applying, you may contact us at (408) 501-8550, or firstname.lastname@example.org for assistance.