Principal Site Reliability Engineer
Roles and Responsibilities:
- 10+ years of experience in application production support environment with ability to solve complex problems and SRE role.
- Responsible for reliability of our end-to-end data infrastructure.
- Figure out root causes for problem tickets and implement long-term, permanent fixes so that we do not encounter similar issues again and again.
- Collaborate with product development teams like Engineering, Sustaining teams so that we can design our systems for better resilience and maintainability. That way, we can pre-empt issues even before we encounter them.
- Tackle issues across the entire stack - hardware, software, application and network.
- Analyze and troubleshoot application issues in a timely fashion and help incident support team as necessary to restore services and prevent from happening again.
- Assist in maintenance and upgrades of existing software applications.
- Provide training and mentor other Engineers when required and documentation and building of knowledge base
- Work with various teams in planning, prioritizing and executing assigned tasks/projects within deadlines.
- Assist in risk assessment and mitigation activities.
- Ready to work in night and weekend on demand basis.
- Ability to multi-task and manage multiple projects/tasks effectively within deadlines.
- Automate existing manual tasks so that we gain order of magnitude efficiency and effectiveness gains by building tools/services with the aim on self-serve and auto-heal.
- Accountable and technical owner for ensuring SRE readiness for new modules that need to be supported from various angles like monitoring, adequate technical onboarding trainings, preparedness to handle incidents and continuous optimizations of existing modules.
- Adept on Linux platform.
- Experience with Docker, Kubernetes, AWS services like S3, SQS, Lambda, EC2, EKS, etc and expert understanding of best practices.
- Experience in debugging backend (like java) and frontend (like node) applications – Memory utilization , Analysis of thread dumps, heap dumps
- Exposure on Networking, load balancers, Messaging Queue and strong database fundamentals (preferably PostgreSQL and knowledge of Oracle and NoSQL DBs like Mongo DB).
- Experience working in programming languages like python, shell, Perl, java, etc.
- Experience in handling issues across the entire stack - hardware, software, application and network.
- Deep knowledge and experience in Incident and problem management.
- Knowledge on JBoss, Tomcat, Spring boot, etc
- Experience in monitoring tools like Splunk, New Relic, Sensu, Foglight, etc.
- Knowledge of Incident, problem, Change management
- Knowledge in APM tools like New Relic
- Knowledge on IBM MQ/Kafka/Redis/Elastic Search.
- Experience in designing, analyzing, and troubleshooting large-scale distributed systems.
- Ability to debug and optimize code and automate routine tasks to eliminate toil.
- Take up current monitoring two notches higher and ensure operations team to be able to detect all critical issues before customer.
- Have a systematic problem-solving approach, coupled with strong communication and analytical skills and a sense of ownership, initiative, grit, and drive.
- Design patterns (microservices/aws architecture patterns/enterprise application design patterns)
- Exposure to CI/CD, GitLab, JIRA, Service Now
- Exposure to UI technologies
- Good to have certifications like ITIL, AWS.