Site Reliability Engineer, Principal
We’re Progress – we offer leading solutions for building and deploying tomorrow’s applications quickly and easily. We are bold, forward-thinking innovators who build products that work and care about our customers. We invent and reinvent every day, work together as one, value and respect each other, and cheer our wins. Join our Core C/ Chef Product Group as a Site Reliability Engineer, Senior in Bangalore.
In our Core C/ Chef Product Group, we develop the world's best products for managing applications and infrastructure at scale, and we deploy them to solve real problems in all kinds of industries. We get to work with the latest in cloud and container technologies. We have the opportunity not just to follow but to shape best practices. Our platform is used to enable billions of people around the world to chat, fly, present, bank, game, shop, and learn. Chances are the applications and devices you use every day to have infrastructure built, deployed, secured, and run with our code.
We are seeking a highly motivated, results-oriented individual with strong Site Reliability Engineering skills and experience in cloud technologies to join our platform engineering team. As a Site Reliability Engineer, you will play a lead role in designing, implementing, and supporting the platform for Chef Cloud services. You will also have a key influence on our future processes and platform design.
What you’ll do:
- Build, operate, and maintain a platform for Chef Cloud services. This will include technologies such as AWS services (ECS, EKS, S3, and more), Kubernetes, service mesh (Linkerd or Envoy), Postgres/RDS, Graph databases, API gateways, authentication services, 3rd party integrations, and more.
- Collaborate on achieving the best design/architecture for our systems and infrastructure.
- Collaborate with other Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews.
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
- Implement modern systems observability solutions including monitoring, alerting, metrics, logging, and APM & distributed tracing.
- Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
- Be on-call for services that the SRE team owns.
- Practice sustainable incident response and post-incident analysis by acting as an incident manager. You’ll follow our existing incident management process and recommend improvements to that process.
- Mentor the team.
Who you are (Our ideal candidate will have some or all of these qualifications):
- You have a Bachelor's degree in Computer Science or related field and 7+ years of relevant experience (or equivalent combination of education and experience).
- You have a solid understanding of and experience with configuration management and compliance automation.
- You have an expert-level understanding of and at least 2 years of working experience with containerization using Docker and Kubernetes in a production environment.
- You’re comfortable deploying and operating services using AWS technologies and have an expert understanding of the various offerings available.
- You’ve built and supported systems using cloud-native (CNCF) technologies at scale.
- Working knowledge on terraform or similar tools
- You are interested in designing, analyzing, and troubleshooting large-scale distributed systems.
- You understand what it means to operate infrastructure as code, and have experience developing services and automation to do so. Chef knowledge would be a plus
- You have a great ability to debug and optimize code and automate routine tasks to eliminate toil.
- You have a systematic problem-solving approach, coupled with strong communication skills and a sense of ownership, initiative, grit, and drive.
- You have designed and implemented applications and systems that scale, are resilient to failure, and are observable.
What we offer in return is the opportunity to join a talented team of bright people and to also enjoy:
- 30 days of earned leaves plus an extra day off for your birthday, various other leaves like marriage leave, casual leave, maternity leave, paternity leave
- Premium Group medical Insurance for the employee and 5 dependents, personal accident insurance coverage, life insurance coverage
- A modern office with a well-equipped gym onsite, free access to yoga, Zumba classes by professional trainers
- Professional development reimbursement
- Interest subsidy on loans - either vehicle or personal loans
Together, We Make Progress
Progress is an inclusive workplace where opportunities to succeed are available to everyone. As a multicultural company serving a global community, we encourage a wide range of points of view and celebrate our diverse backgrounds. Our unique combination of perspectives inspires innovation, connects us to our customers and positively affects our communities. It is only by working together and learning from each other that we make Progress. Join us!