Sr. Site Reliability Engineer
At Shutterfly, we’re all about people — bringing them together, making them feel welcome, and connecting them to experiences. We make our customers’ memories last a lifetime by capturing, preserving, and sharing them through photography and personalized products. Through our family of brands, trend setting products, cutting edge technology, and best in class customer service, we help our customers, and each other, share life’s joy.
Shutterfly is looking for a Senior Site Reliability Engineer to help us build a brand new SRE team from the ground up. Be ready to solve for what is next at Shutterfly.
What You'll Do Here:
- Develop and review designs, create platforms and frameworks, capacity plan, and chaos
- Improve frameworks and services with root cause analysis, blameless postmortems, and follow through to make sure the same incident never happens
- Work cross functionally to understand the full stack and recommend areas for
- Maintain and improve monitoring services, metrics, and reporting for quick issue detection and actionable
- Participate in a shared on-call rotation for high severity
- After incidents, drive the discovery and implement automated self-healing solutions
The Skills You'll Bring:
- 10+ years relevant work experience in a production environment
- 4+ year’s of experience as an SRE
- 2+ years’ experience working in the cloud, preferably in AWS
- Experience with large scale distributed systems
- Experience with metrics monitoring platform solution (ie: SignalFX, Datadog)
It's Helpful But Not Required To Have:
- Experience with infrastructure configuration and automations processes and tools: Terraform, Puppet, Ansible
- Experience with security in the cloud
- Experience with log aggregation solutions: Splunk, etc
If this aligns to your career goals, skills and experience, we want to work with you!