Senior Site Reliability Engineer

Engineering Bengaluru, Karnataka Full-Time



  • The primary responsibility of this position is to perform operations engineering tasks such as:

    • Provide 24x7 on call support.

    • Manage and respond to alerts notification via Pager Duty, Slack and


    • Provide application / infrastructure support, troubleshooting, and

      remediation for all severity issues.

    • Manage incident lifecycle, communication and post-mortem analysis.

    • Define and measure SLO, SLI, SLA, and Error Budgets.

    • Reduce toil through automation and process improvement

    • Develop and maintain SLO and APM performance driven dashboards.

    • Identify and implement resiliency patterns in software applications.

    • Identify and monitor performance metrics based on multiple sources of

      data to increase overall application reliability.

  • The role will focus on working on complex issues identifying, diagnosing and

    recommending engineering solutions.

  • Collaborate closely with Product, Development, Quality and Ops teams to ensure

    that designed solutions respond to non-functional requirements such as availability, performance, cost, security, maintainability, achieve speed to market and quality to Engineering departments.

  • Investigate issues, recommend and test fixes, coordinate issue resolution within technology and with external vendors.

  • Evangelize Site Reliability Engineering best practices to improve system reliability across the organization.

    Required / Desired Skill Set
    Application Performance Monitoring, alerting and dashboards


  • Experience with DataDog

    • Creation, modification and management of performance dashboards

    • Alert management, response and configuration

    • Transaction tracing

    • Synthetic transaction creation and maintenance and


    • Event analytics and correlation

    • Metric reporting, aggregation and analytics.

  • Experience with Elastic Search, Logstash, and Kibana (ELK).

Execute queries to view event data for troubleshooting,

performance analytics and knowledge gathering

  • Cloud computing, cloud architectures, networking, data storage

    observability, etc...

  • Understand Ci/CD Pipelines for software delivery and how to identify

    problems from source code.

  • Linux and Unix operating systems, CLI's, docker, Kubernetes

  • Security engineering and response

  • DevOps principals

  • Incident Management

  • Postmortem analysis and documentation


  • DataDog - Custom metrics collectors, micrometer.

  • DataDog - Custom java instrumentation

  • Data Dog - Centralized configuration management

  • DataDog - API Integration

  • Kibana - dashboard development and maintenance

  • AWS Cloudwatch metric analysis and dashboard management.

  • New Relic performance monitoring, metric analysis and dashboard


    Production Support

  • Required

    • Experience with RDBMS and NoSQL technologies such as MySQL and

      MongoDB, Elastic Search

    • Experience troubleshooting production workloads using technologies such

      as log aggregation systems and APM tooling

    • Ability to quickly identify problems and determine solutions for Cloud

      based Java application infrastructure based on alerting, human

      escalations and behavior change in metrics performance.

    • Analytical background, in the areas of user experience, data integrity and


    • Experience with proactive troubleshooting, event correlation and


    • Ability to read and troubleshoot Java software code, exceptions and stack


    • Networking and how to resolve networking related incidents.

    • Experience supporting data streaming technologies

    • Develop and maintain support run-books and related documentation

    • Incident trend reporting and analytics

    • Postmortem analysis and followup

  • Desired

    • Incident management / ITIL process experience.

    • Leading incident calls and driving to resolution.

    • Mentoring junior engineers on processes and methodologies for

      troubleshooting and diagnostic data collection

      Platform and technology experience


  • DataDog APM, Infrastructure, Synthetics

  • Java Application technologies, JVM and JMX, Java Flight Recorder,

    Thread Dump, Heap Dump

  • AWS EC2

  • Docker, Containers, Kuberenetes, Elastic Kubernetes Service

  • Kafka, RabbitMQ messaging

  • Linux - RHEL, CentOS and Ubuntu

  • MongoDB, MySQL

  • PagerDuty


  • Flink


  • Cloudflare

  • Terraform

  • Harnes

  • Jenkins

  • Python

  • Ansible

  • Bash