Sr. Manager SRE (REMOTE)
POSITION TITLE: Sr Manager, SRE
REPORTS TO:Director of Tech Ops
DEPARTMENT: Tech Ops
LOCATION: Park City / Remote
As the manager of our SRE team, you will lead the team on producing mission-critical platforms, tools, and processes that will ensure the highest levels of availability and reliability of all our applications. We need creative and innovative problem solvers who can partner with our product teams to make their services more usable. Our SRE team is furnished with a standout opportunity to build tools, frameworks, and cloud platforms that will support our company's growth over the next decade.
This leader should have an excellent engineering foundation and well-rounded in architecting, building, and maintaining world class applications and services, with emphasis on stability, scalability, and availability. To be successful, you should possess the technical acumen and managerial experience to lead a group of high-potential business service managers and engineers.
ESSENTIAL DUTIES & RESPONSIBILITIES:
- Manage a team of Service Reliability Engineers and provide them with an environment of coaching, mentoring, feedback, and staff development
- Advocate and prioritize platform automation, development code fixes, and other platform improvements
- Understand your domain and work across product teams to ensure high availability, scalability, fault tolerance, etc.
- Reduce toil for Product teams and SRE in any way you can
- Build and maintain a highly stable, reliable & scalable technology infrastructure at web scale, ensuring all of our products have best in class availability & performance metrics. You need to have an overall understanding of our technology architecture and have a vision for long term global infrastructure architecture.
- Measure availability, performance, capacity, and utilization of the site. You need to obsess about instrumentation of our infrastructure and aggressively drive visibility and transparency through integrated dashboards.
- Manage your team's metrics, report on your team and platform health regularly to leadership
- Aggressively detect, monitor, report, manage, escalate and ultimately prevent site failures and/or degradations in service i.e. own incident management and problem management.
- Effectively collaborate with DevOps, architecture, development, and other teams
- Take ownership of security across the board, especially data protection and also lead regulatory compliance from a technology perspective for the company.
- Have experience managing vendors, both in the context of outages and integrations
QUALIFICATIONS, SKILLS & ABILITIES:
- Track record of building great operations teams through both hiring as well as coaching existing team members.
- 5+ years of shown leadership, key decision making and process re-engineering experience
- Experience in these or similar tools/tech – GCP, RDBMS, Bash, Python, Java, Git, Jenkins, Kubernetes, ArgoCD, Terraform, Ansible, OpenTelemetry, Nagios, SumoLogic etc
- Expertise analyzing complex application, database, network, and OS issues across a distributed large scale customer facing system
- Strong communication skills and ability to work effectively across multiple business and technical teams
- Track record of managing 24x7x365 global systems at 99.9 SLA or higher with extremely high transparency through dashboards, KPIs, SLAs and other relevant success metrics.
- Track record of owning release management with heavy focus on tools and an intense passion for getting software out fast, close to or at continuous delivery state.
- Track record of managing and negotiating with vendors.