Lead Cloud Site Reliability Engineer
Description
Overview
The Lead Site Reliability Engineer is a senior technical leader within the SRE team, responsible for driving strategic initiatives, leading team-level planning, and shaping the technical vision for reliability engineering across Tyler Technologies. This role requires strong collaboration, deep technical expertise, and the ability to influence engineering practices across teams. The ideal candidate brings experience, passion, and thoughtful opinions on how to build resilient systems, balanced with the flexibility to adapt to evolving business needs.
Responsibilities
- Lead technical planning and execution of SRE initiatives, ensuring alignment with business priorities and cross-functional teams.
- Drive the implementation and adoption of observability and incident management tooling to monitor AWS EKS-based systems, with a strong focus on performance, reliability, scalability, and rapid response.
- Ensure architecture and deployment models support SLA commitments and are designed for future scale.
- Apply software engineering best practices to resolve complex problems and reduce operational toil.
- Collaborate with product support and development teams to enhance customer experience through automation and self-service tools.
- Conduct root cause analysis and implement preventative measures to improve system resilience.
- Lead and participate in incident retrospectives to strengthen future response efforts.
- Participate in on-call rotations, providing critical support and guidance.
- Mentor engineers across the SRE and broader engineering organization.
- Advocate for SRE principles and foster a culture of operational excellence.
- Own and foster strategic relationships with key vendors and third-party service providers to ensure alignment with reliability goals, tooling needs, and evolving business requirements.
Qualifications
- 6+ years of experience in software engineering, systems engineering, or technical operations, with a focus on large-scale cloud applications.
- Proven leadership in driving technical initiatives and team-level planning.
- Expertise in Site Reliability Engineering concepts and practices.
- Experience implementing and leveraging observability tooling to drive system reliability, performance insights, and incident response improvements.
- Experience deploying and supporting containerized applications on cloud platforms, preferably EKS on AWS.
- Proficiency in infrastructure as code technologies, such as Terraform.
- Strong software engineering skills in languages like Python, JavaScript, or Go.
- Familiarity with DevOps and CI/CD methodologies.
- Bachelor’s degree in Computer Science or related field.