Site Reliability Engineer
Title: Senior site Reliability Engineer
Experience Level: 6 – 9 Years
Shifts: Yes, 24/7 on rotation basis
Quotient Technology Inc. (NYSE: QUOT) is the leading digital promotions, media and analytics company using proprietary data to deliver personalized digital coupons and ads to millions of shoppers daily. Our core platform, Quotient Retailer iQ™, connects to a retailer’s point-of-sale system and provides targeting and analytics for consumer-packaged goods (CPG) brands and retailers. Our distribution network also includes our Coupons.com app and website, thousands of publishing partners and, in Europe, the Shopmium mobile app. We serve hundreds of CPGs, such as Clorox, Procter & Gamble (P&G), General Mills and Kellogg’s, and retailers like Albertsons Companies, CVS, Dollar General, Kroger and Walgreens. We operate Crisp Mobile, which creates mobile ads aimed at shoppers, and Ahalogy, a leading influencer marketing firm. Founded in 1998, Quotient is based in Mountain View, California, with offices across the USA., in Bangalore, India; Paris and London. Learn more at Quotient.com, and follow us on Twitter @Quotient.
The Team: As a Site Reliability team we enjoy working on challenges that no one has solved yet. Being the first Line in handling any issues for the company, we partner with other Engineering and Product teams to have the right toolset to deliver Best Customer Experience on Quotient and its partner's Site. SRE’s together manage a large-scale system made up of thousands of servers in on-prem data centre’s and cloud, request rates in the tens of thousands per minute, sub-millisecond SLA’s, and data measured in terabytes. Responding to production issues on a 24/7 basis we work and support technology stack comprises of Cloud (GCP, Azure & AWS), VMware, Ubuntu, Kubernetes, Java, MySQL, Cassandra, distributed event streaming, memory stores, network devices (Switches, routers, firewalls, load balancers, Storage).
You are right fit: If you are passionate about operational excellence for large scale platforms and distributed systems that underpin companies Promotion, Media, and Analytics offerings? You will be right for this role if you bring in software engineering perspective to deliver quality operations at scale, driving automation in every aspect of the job. You would be responsible for escalated events and incident You need to bring energy and relentless focus on continuous improvement within a fast-paced environment with an attitude of work until resolved.
You would do (Responsibilities):
- Responsible for the up time and reliability of infrastructure of Quotient.
- Responsible for activities/projects involving Datacenter migration to Cloud (especially GCP).
- Management of events related to IT infrastructure elements (e.g. data centers, networks, servers, storage, operating systems, Internet security, and business applications).
- 24x7x365 Support SRE team with Escalated events and provide guidance to the team to support Operations.
- Plan and support critical Incident Management, Problem management, Activities pertaining to Change management, Reporting of KPI’s, CMDB management.
- Responsible for managing activities/projects for Network Team, DB team (BigData “Cassandra”), BI etc.
- Identify, Design, analyse and enhance (Automate) large-scale distributed systems.
- Coordinate work with Engineering teams to automate activities pertaining to network and application up-time and swift resolution of all issues.
- Measure and monitor availability, latency and overall system health work on automating the processes and activities which can increase system availability and response time.
- Scale systems sustainably through mechanisms like automation and evolve systems to improve reliability and velocity.
- Manage & build production cluster of BigData (Cassandra).
- Use industry tools such as Netcool, Servicenow, Moogsoft, Cacti, Solarwinds, Nagios, Splunk, ELK, Cloudera, Pagerduty etc.
- Maintain and enhance infrastructure for Kafka & PubSub (GCP).
- Systematic problem-solving approach coupled with strong communication skills and a sense of ownership and drive.
- Provide input into process and procedure for increasing reliability, reducing procedural errors and managing change within the datacenters.
- Extensive process-level and node-level monitoring and auto healing of entire cluster.
Managing, provisioning and servicing Datacenter and Cloud servers.
- Contribution to back-end services to contribute to its infrastructure system design.
- Responsible for identifying Problem incidents and driving it for resolution.
- Responsible for driving RCA for high priority incidents and working with respective development teams on preventive measures.
Do you carry (Qualification):
- Experience in a Systems Engineering/SRE role in a large-scale environment.
- Experience troubleshooting incidents/problems and working with a team to resolve large scale production issues.
- Advance knowledge of at least one programming language: Python, Perl, Java, PowerShell is a must.
- Strong knowledge of Linux systems and Operations.
- Good understanding of standard networking protocols and components such as HTTP, DNS, TCP/IP, the OSI Model, networking and load balancing.
- Good technical skills in Apache, Tomcat, Jetty, Memcached, Java, CDN technologies, network analysis tools, or equivalent.
- Managing and maintain logging systems like Splunk.
- Extensive experience with Puppet/Ansible.
- Hands on Experience with monitoring tools such as Extrahop, SolarWinds, Nagios etc.
- Experience with Git, continuous integration and testing methods.
- Effective written and verbal communications skills are required
- Strong problem solving and troubleshooting skills are required.
- Willing to Work in a 24x7x365 setup, involving night shifts.
- Certification in any cloud (AWS, Azure, GCP) is a must.
- Strong runbook automation experience is needed.
- Experience with runbook automatin tools like Rundeck, Pliant Ayheu is an added advantage.
If you feel this is a right fit for your next move and it makes you feel excited, then we would love to have a conversation with you.