Manager , Site Reliability Engineering

Operations Requisition ID 4543 Foster City, California

Description

Qualys’ site reliability engineering (SRE) team supports all Qualys products across all our production environments, including our 8 global multi-tenant platforms and over 100 on-premise setups. Effective incident management is a big part of our SRE efforts to minimize the disruption of an incident and restore normal business operations as quickly as possible. We are seeking a highly motivated and talented Director , Site Reliability Engineerng to lead our SRE team that works on a 24/7 rotation. In this role, you will be responsible for building and leading a group that responds proactively to alerts and is accountable for the efficiency and effectiveness of service delivery over the life cycle of an incident, Deployment of applications in production , automating the deployments , making the production environments very stable . 

 

We are looking for an individual who believes in SRE principles, has a software engineering mindset, and wants to be part of an organization that is transforming itself to be more agile and nimble operationally.

 

Responsibilities

  • Ensure effective performance and 24x7 availability of all production systems.
  • Monitor alerts coming out of all Qualys platforms, and coordinate with Operations/SRE/DBRE/Engineering teams as necessary to take preventive or corrective action to resolve any incidents, with a goal to minimize MTTR.
  • Put in place and manage an effective on-call rotation within the team.
  • Work with engineering teams to set up proper monitoring and alerting thresholds across all Qualys services and applications so SRE team is focusing on key areas to stabilize the platforms . 
  • Accountability for platform uptime SLAs.

 

Desired Skills

  • Strong prior production operations experience leading a first responder incident management team for a high-traffic platform.
  • Solid exposure to monitoring tools such as Prometheus, ELK, Kibana, AppDynamics, Splunk, Grafana, etc.
  • Very good experience on how to use Kubernetes , Jenkins , Terraform templates .
  • Very good experience on the capacity sizing of the applications .
  • Good experience in configuring and managing on-call and alerting platforms like PagerDuty, etc.
  • Comfortable working in a dynamic environment with ability to coordinate multiple tasks simultaneously.
  • Strong verbal and written communication skills are essential as are the ability to work in a disciplined manner and to remain composed under pressure.
  • Obtain and exhibit expert knowledge of Qualys’ infrastructure, monitoring, and its products and services
  • Coordinate with Incident management team to produce weekly reports and dashboards for various products to clearly showcase, backed by data, any areas of improvement that need to be taken up.
  • Must have a strong passion for continuous improvement.

EEO Employer/Vet/Disabled