Manager , Site Reliability Engineering
Qualys’ site reliability engineering (SRE) team supports all Qualys products across all our production environments, including our 8 global multi-tenant platforms and over 100 on-premise setups. Effective incident management is a big part of our SRE efforts to minimize the disruption of an incident and restore normal business operations as quickly as possible. We are seeking a highly motivated and talented Director , Site Reliability Engineerng to lead our SRE team that works on a 24/7 rotation. In this role, you will be responsible for building and leading a group that responds proactively to alerts and is accountable for the efficiency and effectiveness of service delivery over the life cycle of an incident, Deployment of applications in production , automating the deployments , making the production environments very stable .
We are looking for an individual who believes in SRE principles, has a software engineering mindset, and wants to be part of an organization that is transforming itself to be more agile and nimble operationally.
- Ensure effective performance and 24x7 availability of all production systems.
- Monitor alerts coming out of all Qualys platforms, and coordinate with Operations/SRE/DBRE/Engineering teams as necessary to take preventive or corrective action to resolve any incidents, with a goal to minimize MTTR.
- Put in place and manage an effective on-call rotation within the team.
- Work with engineering teams to set up proper monitoring and alerting thresholds across all Qualys services and applications so SRE team is focusing on key areas to stabilize the platforms .
- Accountability for platform uptime SLAs.
- Strong prior production operations experience leading a first responder incident management team for a high-traffic platform.
- Solid exposure to monitoring tools such as Prometheus, ELK, Kibana, AppDynamics, Splunk, Grafana, etc.
- Very good experience on how to use Kubernetes , Jenkins , Terraform templates .
- Very good experience on the capacity sizing of the applications .
- Good experience in configuring and managing on-call and alerting platforms like PagerDuty, etc.
- Comfortable working in a dynamic environment with ability to coordinate multiple tasks simultaneously.
- Strong verbal and written communication skills are essential as are the ability to work in a disciplined manner and to remain composed under pressure.
- Obtain and exhibit expert knowledge of Qualys’ infrastructure, monitoring, and its products and services
- Coordinate with Incident management team to produce weekly reports and dashboards for various products to clearly showcase, backed by data, any areas of improvement that need to be taken up.
- Must have a strong passion for continuous improvement.