Site Reliability Engineer
To support our continued business growth in the US and in international markets, we are setting up a world-class product development and operations support center in Bangalore India. We are looking for people with a passion for innovation and creative use of new technologies in building consumer applications in a retail domain.
We're looking for a talented and driven Site Reliability Engineers. In this role, the candidate will be working to improve the reliability and performance of our services. You will help expand, maintain and troubleshoot a geographically diverse network of services in a 24x7 operations environment across different verticals.
- Responsible for the
up timeand reliability of infrastructure of Quotient.
- Management of events related to IT infrastructure elements (e.g. data centers, networks, servers, storage, operating systems, Internet security, and business applications).
- 24x7x365 Monitoring and response to events, Incident Management, Problem management, Activities pertaining to Change management, Reporting of KPI’s, CMDB management.
- Responsible for activities/projects involving Datacenter migration to Cloud (GCP, Azure etc)
- Responsible for managing activities/projects for Network Team, DB Team, BI etc
- Interest in designing, analyzing and troubleshooting large-scale distributed systems.
Systematicproblem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
- Troubleshoot issues spanning the entire OSI layer. Work with Engineering teams to achieve maximum network and application
up-timeand swift resolution of all issues
- Maintain services once they are
liveby measuring and monitoring availability, latency andoverall system health.
- Scale systems sustainably through mechanisms like automation and evolve systems to improve reliability and velocity.
- Use industry tools such as Netcool,
Servicenow, Moogsoft, Cacti, Solarwinds, Nagios, Splunk, Cloudera, Pagerduty etc
- Provide input into process and procedure for increasing reliability, reducing procedural errors and managing change within the datacenters.
- Extensive process-level and node-level monitoring and auto healing of
Managing, provisioning and servicing Datacenter and Cloud servers.
- Contribution to back-end services to contribute to its infrastructure system design.
- Responsible for identifying Problem incidents and driving
- Responsible for driving RCA for high priority incidents and working with respective development teams on preventive measures.
- Experience in a Systems Engineering/SRE role in a large scale environment.
- Experience troubleshooting incidents/problems and working with a team to resolve large scale production issues.
- Knowledge of at least one programming language: Python, Perl, Java, C++, Powershell, etc.
- Strong knowledge of Linux systems
- Good understanding of standard networking protocols and components such as HTTP, DNS, TCP/IP, the OSI Model, networking and load balancing.
- Technical skills in Apache, Tomcat, Jetty, Memcached, Java, CDN technologies, network analysis tools, or equivalent.
- Familiarity with logging systems like Splunk.
- Experience with Puppet/Ansible.
- Experience with monitoring tools such as App Dynamics, Extrahop, Solarwinds etc.
- Experience with Git, continuous integration and testing methods.
- Effective written and verbal communications skills are required
- Strong problem solving and troubleshooting skills are required.
- Willing to Work in a 24x7x365 setup, involving night shifts.