Senior Site Reliability Engineer
The primary responsibility of this position is to perform operations engineering tasks such as:
Provide 24x7 on call support.
Manage and respond to alerts notification via Pager Duty, Slack and
Provide application / infrastructure support, troubleshooting, and
remediation for all severity issues.
Manage incident lifecycle, communication and post-mortem analysis.
Define and measure SLO, SLI, SLA, and Error Budgets.
Reduce toil through automation and process improvement
Develop and maintain SLO and APM performance driven dashboards.
Identify and implement resiliency patterns in software applications.
Identify and monitor performance metrics based on multiple sources of
data to increase overall application reliability.
The role will focus on working on complex issues identifying, diagnosing and
recommending engineering solutions.
Collaborate closely with Product, Development, Quality and Ops teams to ensure
that designed solutions respond to non-functional requirements such as availability, performance, cost, security, maintainability, achieve speed to market and quality to Engineering departments.
Investigate issues, recommend and test fixes, coordinate issue resolution within technology and with external vendors.
Evangelize Site Reliability Engineering best practices to improve system reliability across the organization.
Required / Desired Skill Set
Application Performance Monitoring, alerting and dashboards
Experience with DataDog
Creation, modification and management of performance dashboards
Alert management, response and configuration
Synthetic transaction creation and maintenance and
Event analytics and correlation
Metric reporting, aggregation and analytics.
Experience with Elastic Search, Logstash, and Kibana (ELK).
• Execute queries to view event data for troubleshooting,
performance analytics and knowledge gathering
Cloud computing, cloud architectures, networking, data storage
Understand Ci/CD Pipelines for software delivery and how to identify
problems from source code.
Linux and Unix operating systems, CLI's, docker, Kubernetes
Security engineering and response
Postmortem analysis and documentation
DataDog - Custom metrics collectors, micrometer.
DataDog - Custom java instrumentation
Data Dog - Centralized configuration management
DataDog - API Integration
Kibana - dashboard development and maintenance
AWS Cloudwatch metric analysis and dashboard management.
New Relic performance monitoring, metric analysis and dashboard
Experience with RDBMS and NoSQL technologies such as MySQL and
MongoDB, Elastic Search
Experience troubleshooting production workloads using technologies such
as log aggregation systems and APM tooling
Ability to quickly identify problems and determine solutions for Cloud
based Java application infrastructure based on alerting, human
escalations and behavior change in metrics performance.
Analytical background, in the areas of user experience, data integrity and
Experience with proactive troubleshooting, event correlation and
Ability to read and troubleshoot Java software code, exceptions and stack
Networking and how to resolve networking related incidents.
Experience supporting data streaming technologies
Develop and maintain support run-books and related documentation
Incident trend reporting and analytics
Postmortem analysis and followup
Incident management / ITIL process experience.
Leading incident calls and driving to resolution.
Mentoring junior engineers on processes and methodologies for
troubleshooting and diagnostic data collection
Platform and technology experience
DataDog APM, Infrastructure, Synthetics
Java Application technologies, JVM and JMX, Java Flight Recorder,
Thread Dump, Heap Dump
Docker, Containers, Kuberenetes, Elastic Kubernetes Service
Kafka, RabbitMQ messaging
Linux - RHEL, CentOS and Ubuntu