Senior Site Reliability Engineer

Engineering Bangalore, India


Description

Quotient, Inc. (NYSE:QUOT) is a leader in digital coupons. We operate a promotion platform that connects great brands and retailers with consumers through Web, mobile and social channels. We deliver digital coupons to consumers, including printable coupons, save-to-card coupon and coupon codes for e-commerce. Our company is transforming the multi-billion-dollar promotions industry into the digital world.

To support our continued business growth in the US and in international markets, we are setting up a world class product development and operations support center in Bangalore India. We are looking for people with a passion for innovation and creative use of new technologies in building consumer applications in a retail domain.

 

We're looking for a talented and driven Lead Site Reliability Engineer for this role, the candidate will be working to improve the reliability and performance of our services. We are looking for Strong technical leader with deep understanding of SRE culture

 

You will help expand, maintain and troubleshoot a geographically diverse network of services in a 24x7 operations environment across different verticals.

Responsibilities

  • Responsible for the up time and reliability of infrastructure of Quotient.
  • Management of events related to IT infrastructure elements (e.g. data centers, networks, servers, storage, operating systems, Internet security, and business applications).
  • 24x7x365 Monitoring and response to events, Incident Management, Problem management, Activities pertaining to Change management, Reporting of KPI’s, CMDB management.
  •         Develop automation to support SRE activities and achieve specific reliability and supportability goals
  •          Responsible for activities/projects involving Datacenter migration to Cloud (GCP, Azure etc)
  • Responsible for managing activities/projects for Network Team, DB Team, BI etc
  • Interest in designing, analyzing and troubleshooting large-scale distributed systems.
  • Identify process gaps and implement process improvements to increase operational efficiency.
  • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
  • Troubleshoot issues spanning the entire OSI layer. Work with Engineering teams to achieve maximum network and application up-time and swift resolution of all issues
  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
  • Scale systems sustainably through mechanisms like automation and evolve systems to improve reliability and velocity.
  • Use industry tools such as Netcool, Servicenow, Moogsoft, Cacti, Solarwinds, Nagios, Splunk, Cloudera, Pagerduty etc
  • Provide input into process and procedure for increasing reliability, reducing procedural errors and managing change within the datacenters.
  • Extensive process-level and node-level monitoring and auto healing of entire cluster.

Managing, provisioning and servicing Datacenter and Cloud servers.

  • Contribution to back-end services to contribute to its infrastructure system design.
  • Responsible for identifying Problem incidents and driving it for resolution.
  • Responsible for driving RCA for high priority incidents and working with respective development teams on preventive measures.

Qualifications

  • B.Tech/M.Tech/MCA with 5 - 8 years of relevant experience.
  • Experience in a Systems Engineering/SRE role in a large scale environment.
  • Experience troubleshooting incidents/problems and working with a team to resolve large scale production issues. 
  • Knowledge of at least one programming language: Python, Perl, Java, C++, Powershell, etc.
  • Strong knowledge of Linux systems
  • Good understanding of standard networking protocols and components such as HTTP, DNS, TCP/IP, the OSI Model, networking and load balancing.
  • Technical skills in Apache, Tomcat, Jetty, Memcached, Java, CDN technologies, network analysis tools, or equivalent.
  • Familiarity with logging systems like Splunk.
  • Experience with Puppet/Ansible.
  • Experience with monitoring tools such as App Dynamics, Extrahop, Solarwinds etc.
  • Experience with Git, continuous integration and testing methods.
  • Effective written and verbal communications skills are required
  • Strong problem solving and troubleshooting skills are required.
  • Willing to Work in a 24x7x365 setup, involving night shifts.