Senior Site Reliability Engineer

Production Operations Bangalore, India


Description

To support our continued business growth in the US and in international markets, we are setting up a world-class product development and operations support center in Bangalore India. We are looking for people with a passion for innovation and creative use of new technologies in building consumer applications in a retail domain.

We're looking for a talented and driven Site Reliability Engineers. In this role, the candidate will be working to improve the reliability and performance of our services. You will help expand, maintain and troubleshoot a geographically diverse network of services in a 24x7 operations environment across different verticals.

 

Responsibilities

 

  • Responsible for the up time and reliability of infrastructure of Quotient.
  • Management of events related to IT infrastructure elements (e.g. data centers, networks, servers, storage, operating systems, Internet security, and business applications).
  • 24x7x365 Monitoring and response to events, Incident Management, Problem management, Activities pertaining to Change management, Reporting of KPI’s, CMDB management.
  • Responsible for activities/projects involving Datacenter migration to Cloud (GCP, Azure etc)
  • Responsible for managing activities/projects for Network Team, DB Team, BI etc
  • Interest in designing, analyzing and troubleshooting large-scale distributed systems.
  • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
  • Troubleshoot issues spanning the entire OSI layer. Work with Engineering teams to achieve maximum network and application up-time and swift resolution of all issues
  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
  • Scale systems sustainably through mechanisms like automation and evolve systems to improve reliability and velocity.
  • Use industry tools such as Netcool, Servicenow, Moogsoft, Cacti, Solarwinds, Nagios, Splunk, Cloudera, Pagerduty etc
  • Provide input into process and procedure for increasing reliability, reducing procedural errors and managing change within the datacenters.
  • Extensive process-level and node-level monitoring and auto healing of entire cluster.
    Managing, provisioning and servicing Datacenter and Cloud servers.
  • Contribution to back-end services to contribute to its infrastructure system design.
  • Responsible for identifying Problem incidents and driving it for resolution. 
  • Responsible for driving RCA for high priority incidents and working with respective development teams on preventive measures.

 

Qualifications

 

  • Experience in a Systems Engineering/SRE role in a large scale environment.
  • Experience troubleshooting incidents/problems and working with a team to resolve large scale production issues.  
  • Knowledge of at least one programming language: Python, Perl, Java, C++, Powershell, etc.
  • Strong knowledge of Linux systems
  • Good understanding of standard networking protocols and components such as HTTP, DNS, TCP/IP, the OSI Model, networking and load balancing.
  • Technical skills in Apache, Tomcat, Jetty, Memcached, Java, CDN technologies, network analysis tools, or equivalent.
  • Familiarity with logging systems like Splunk. 
  • Experience with Puppet/Ansible. 
  • Experience with monitoring tools such as App Dynamics, Extrahop, Solarwinds etc.
  • Experience with Git, continuous integration and testing methods.
  • Effective written and verbal communications skills are required
  • Strong problem solving and troubleshooting skills are required.
  • Willing to Work in a 24x7x365 setup, involving night shifts.