Site Reliability Engineer (Delivery Automation)
Diligent is the world’s largest GRC SaaS provider, serving nearly 1 million users from 25,000 organizations around the world. Our software enables holistic and informed conversations about governance, risk and compliance and ensures CEOs, CFOs and the board have an integrated view of audit, risk, information security, ethics and compliance from across the organization. Our world-changing idea is to bring technology, insights and confidence to leaders so they can build more effective, equitable, and successful organizations – and create lasting, positive impact on the world. We seek to empower organizations to be better for their stakeholders and communities, for their customers and employees, for their bottom line. Headquartered in New York, Diligent also has offices in Washington D.C., London, Galway, Budapest, Vancouver, Bengaluru, Munich, and Sydney.
Diligent is looking for a Site Reliability Engineer, Delivery Automation in Bangalore India. You will be responsible for system availability, infrastructure, and process automation. Your focus will be to reduce toil by measuring repetitive, manual tasks, and automating them where possible. Strong communication skills will be required as you will be working with multiple development teams to implement more efficient and reliable procedures. The ideal candidate will someone that is always looking to eliminate manual Customer Service requests.
- A team player with excellent communication and collaboration skills.
- Always ready to learn and experiment on leading/innovative technologies.
- Not afraid of change and passionate about process improvement.
- Someone who cares about reliability, availability, security, and wants to develop autonomous solutions.
- Design, code, test, and deliver software and procedures to eliminate toil.
- Define automated playbook that integrates with ServiceNow and our incident management systems.
- Producing documents that outline the incident resolution process and the runbook requirements for the incident resolution workflow.
- Leading incident response efforts.
- Deploying new infrastructure with automation.
- Defining and measuring SLIs, SLO and SLA.
- Improving application performance and availability via log and metric analysis and anomaly detection
- Software Engineering, SRE, or DevOps experience.
- Knowledge of configuration management, release automation, and orchestration technologies such as Ansible, Terraform, Octopus, Jenkins, Puppet.
- Good working experience for a wide range of technologies, frameworks, and platforms such as AWS/Azure Cloud, Windows, Linux, and Kubernetes,
- HashiCorp Vault and database management are a plus.
- Hands-on experience with Python, PowerShell, .Net or Go.
- Working knowledge (or willingness to learn) of automating infrastructure components (e.g. routers, load balancers, cloud products, container systems, compute, storage, and networks)
- Participate in architectural sessions and provide solutions to complex problems
- Strong problem solving, analytic, and time management skills.