Senior Director of Platform and Site Reliability Engineering
Palo Alto Networks® is the fastest-growing security company in history. We offer the chance to be part of an important mission: ending breaches and protecting our way of digital life. If you are a motivated, intelligent, creative, and hardworking individual, then this job is for you!
Sr Director, SRE reports directly to the Vice President of Infrastructure and leads the following functions in IT
Platform Engineering: Build and manage Platform as a service for all IT developers.
Site Reliability Engineering: End to end ownership of Application availability, performance and scalability
Site reliability and Platform Engineering are key functions in IT and this highly visible senior leadership role will be responsible for Infrastructure platforms and application support strategy, roadmap, and technical implementation of the IT Transformation programs
- Manage Compute Platform as a service with end-to-end responsibility for delivering and supporting the on-prem and cloud compute platforms ( GCP, AWS) , VMWARE, Kubernetes, Terraform, Ansible, CI/CD, Artifactory etc for continuously deploying applications.
- Own automation for delivery of Platform services using Infrastructure as Code. Build standard playbooks for Platform which can be consumed across multiple teams in the organization.
- Lead delivery of Cloud Infrastructure strategies aligned with business objectives with a focus on mass Application movements into the Cloud involving design, implementation and Infrastructure automation.
- Build a high performing team of Cloud Platform SMEs and platform leads while mentoring traditional platform SMEs on cloud computing best practices, technology, and adoption.
- Build and manage an SRE function that owns application availability and performance and manage it through automation and proactive/predictive alerts by having a strong data analytical tool set to identify areas of improvement
- Implement comprehensive service monitoring to ensure uptime and performance, including synthetic, real user, system, application performance, dashboards etc.
- Define, measure, and meet key Service Level Objectives including availability, performance, incidents and chronic problems
- Own end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence; eventually automate response to all non-exceptional service conditions.
- Partner with application and business stakeholders to ensure high quality product is developed and released into production. Establish and periodically update the Release Policy which governs the release process and details release categories, release activities, role & responsibilities, exception, etc.
- Work closely with Enterprise Architecture and Information Security to specify and document solutions and practices.
- Keep abreast with evolving threats/risks, industry trends and work to implement best practices in the organization.
- BA/BS degree in Computer Science or related technical field, or equivalent practical experience.
- 10+ years of hands-on technical experience combined with strong management and communication skills.
- Solid understanding of Windows, Linux, Networking, TCP-IP, Routing, Switching, Firewalls, Load balancers and other infrastructure components
- Solid understanding of modern cloud technologies and developer family of products: GKE, Istio, Serverless, Cloud Build, Monitoring and Logging, as well as the Microservices, DevSecOps etc.
- Experience running revenue generating applications in a public cloud and IaaS, including real world experience with at least one public cloud provider: AWS, Google Cloud or Microsoft Azure
- Experience building, scaling, and running production operations for heterogeneous applications.
- Strong troubleshooting experience and skillset to resolve incidents across multiple domains.
- Ability to nurture and support a strong operations culture: customer/service focus excellent technology; high quality implementations; self-motivated innovation and problem-solving.
- Demonstrated ability of establishing and maintaining metrics-based process improvement
- Demonstrated ability to develop strong alliances with those outside of your immediate organization
- Experience in building and managing strong technical teams
- Excellent communications, organization, and time management skills