SRE Manager

IT Administration San Francisco, California Remote, United States


Description

ON24 is on a mission to transform the way businesses drive revenue and customer engagement through data-rich digital experiences. Powered by the ON24 Platform, marketers create and deliver live, always-on and personalized webinar, content and virtual event experiences to engage audiences in real-time, to generate powerful buying signals and to accelerate pipeline. With billions of engagement minutes created, ON24 is the network where enterprises engage prospects and customers at global scale. Headquartered in San Francisco, ON24 has a wide international footprint serving the regions of North America, EMEA and JPAC. For more information, visit https://www.on24.com.

The SRE Manager will lead a small team dedicated to proactively building reliability into the ON24 Digital Experience platform that is comprised of many complex & integrated systems that leverage multiple programming languages, third-party services, networks and SaaS integrations.  The SRE manager will be tasked with improving the overall observability and support of the applications and infrastructure.  In addition, the manager will need a breadth of knowledge and an ability to pull different technology disciplines and soft skills together for the common goal – proactively building resilience into ON24’s IT infrastructure and applications.  The manager will build a team with the troubleshooting and scripting skills to improve the monitoring, availability, performance, and security of ON24’s customer-facing services and ensure they are designed for 24x7 availability.  Focuses of site reliability engineering include automation, system design, and improvements to system resilience.

Core responsibilities:

1.) People management

Along with the normal administrative work required from a people manager, SRE managers need to know how different disciplines can come together on an SRE team. Oftentimes, SRE teams will act somewhat independently from other engineering teams and need to work with some level of autonomy. However, it’s important that site reliability engineering managers are also well-connected with the broader IT, engineering and business teams – staying up-to-date on feature development and how it could affect the system’s overall reliability.

2.) Setting SLOs, SLAs, and SLIs

Service-level objectives, service-level agreements and service-level indicators are essential to SRE teams. The site reliability engineering manager will define what it means for the system to be ‘available’ and dictate the availability SLO (internal metrics) of the system. Then, the SRE manager needs to provide an SLA to business teams and the rest of engineering to show how much availability they can promise to customers. Then, the team can start to track SLIs to evaluate whether the system is meeting the required percentage of availability.

3.) Project planning and prioritization

Site reliability engineering managers are also in charge of project planning and task prioritization. It’s important that SRE managers sit in on quarterly planning and sprint planning with the greater engineering and IT teams. Then, the site reliability engineering manager can assess the key objectives for the next few sprints and pass those priorities to the rest of the SRE team. This way, the SRE team can begin building features and functions that proactively monitor the health of new features, communicate observations to the rest of the team and add reliability to the overall architecture.

4.) Improving the on-call incident response process

The SRE manager will also play a role in scaling and optimizing the overall on-call process.  Because incident response is such a major part of maintaining uptime and handling reliable services, the SRE manager will be an important part of the cross-functional effort to investigate and remediate issues that impact or could impact service delivery.  The SRE team will take ownership of setting on-call rotations, alert rules, communication methods and incident response plans.

5.) Opening up communication channels

The Site Reliability Engineering manager will have visibility into how teams across engineering and IT are working and can establish communication best practices throughout the entire software development lifecycle.  The SRE team will track the effectiveness of these practices and iterate when necessary.

6.) Improving service observability

In Google’s SRE eBook, they laid out the four golden signals of SRE monitoring. The four golden signals include latency, traffic, error rate and saturation. While these signals are only the start of building a highly observable system, implementing the four golden signals is a great start for any SRE manager. Observability is crucial for enabling site reliability engineers to identify areas for improvement, prioritize future work and learn from the way their system behaves.

7.) Proactively testing the flexibility and resilience of the system

Site reliability engineering managers should proactively run tests through their applications and infrastructure.  This will help uncover faults and limitations that can then be addressed to further enhance system reliability.

 

Other responsibilities:

  • Utilize your Coding and Automation of Applications to implement automated tests, automated deployments, and operational tools for a hybrid on-prem/cloud platform.
  • Collaborate with Operations, Engineering, and Support teams to troubleshoot issues and implement a program of continuous improvement to maximize service uptime.
  • Set Strategic and Operational goals and work with the team to deliver on goals.
  • Implementation of proactive monitoring, alerting, trend analysis and self-healing system.
  • Participate in on-call rotations, driving restoration and repair of service-impacting issues
  • Define non-functional requirements as part of the product lifecycle to influence the new designs, standards, and methods for scalable, highly available distributed systems
  • Contribute to product development / engineering as needed to ensure Quality of Service of Highly Available services
  • Take a command-and-control role as Incident Manager during critical incidents focusing on minimizing MTTR & MTTD Identify, evaluate and execute preventive measures to minimize/avoid impact to the customers experience.
  • Proactive v/s Incident driven Participate in After Action Reviews and facilitate Root Cause Analysis to drive the repair of Problem Records in order to prevent recurrence through to closure including, but not limited to, resolution of product/service defects or design changes, infrastructure changes, or operational changes
  • Partner with SREs and lead by example – as this will be a new team, the manager should also be a hands-on contributor initially.

 

Qualifications:

  • 5+ years of Systems/Applications automation in 24x7 Production Services environments
  • BS in Computer Science, Computer Engineering, Math, or equivalent professional experience
  • Fluency with at least one current generation scripting language used by DevOps professionals (Python, Perl, PHP, Ruby), Java Development and/or .NET
  • Excellent troubleshooter, utilizing a systematic problem-solving approach spanning code, systems, and network theory & protocols (TCP/IP, UDP, ICMP) ability to read a packet capture/tcpdump, etc.
  • Experience in designing, analyzing, and diagnosing large-scale distributed systems + Windows Server and/or Linux systems internals (system libraries, file systems, client-server protocols)
  • Experience with elastically scalable, fault tolerance and other cloud architecture patterns Experience operating on AWS (both PaaS and IaaS offerings)
  • Experience in both Windows (2k8R2+) and Linux (centos) + Security triage & forensic analysis
  • Experience with Continuous Integration and Continuous Delivery concepts, including Infrastructure as code utilizing tools like Terraform, Cloudformation and Chef/SaltStack
  • Familiarity with Containerization concepts like Docker, and PaaS services on AWS NoSQL/Docker/Micro-services/Forensic-Analysis experience is a big plus
  • Proven strength in SaaS services, experience in massive scale web operations

Additional skills:

  • ServiceNow, APM, Log Aggregation, Network Analysis, ITIL/ITSM, CI/CD, Ansible Automation

 

ON24 is proud to be an equal employment opportunities (EEO) workplace to all employees and applicants for employment without regard to race, color, religion, sex, national origin, age, disability or genetics. In addition to federal law requirements, ON24 complies with applicable state and local laws governing nondiscrimination in employment in every location in which the company has facilities. This policy applies to all terms and conditions of employment, including recruiting, hiring, placement, promotion, termination, layoff, recall, transfer, leaves of absence, compensation and training.

Pursuant to the San Francisco Fair Chance Ordinance, ON24 will consider for employment qualified applicants with arrest and conviction records.

 
#LI-LH
#LI-SanFrancisco
#LI-UnitedStates