Senior Site Reliability Engineer II
Description
At Shutterfly, we make life’s experiences unforgettable. We believe there is extraordinary power in the self-expression. That’s why our family of brands helps customers create products and capture moments that reflect who they uniquely are.
Shutterfly is looking for a Senior Site Reliability Engineer to join our team. Shutterfly is undergoing a comprehensive consumer website re-platforming effort, with the Site Reliability Engineering (SRE) team playing a pivotal role in building shared infrastructure and ensuring future efficiency and supportability. The Senior Site Reliability Engineer II role is responsible for ensuring the reliability, availability, and performance of Shutterfly’s consumer systems. This position requires deep technical expertise in performance troubleshooting, system optimization, and automation to help maintain resilient, scalable, and cost-efficient platforms. As a senior member of the SRE team, you will collaborate closely with development and operations teams, contribute to automation and observability solutions, and serve as a subject matter expert during incidents.
What You'll Do Here:
- Perform advanced performance analysis and troubleshooting across distributed systems to ensure optimal availability, scalability, and cost efficiency.
- Implement and maintain monitoring, alerting, and observability solutions to provide proactive visibility into application and infrastructure health.
- Partner with development teams to influence service design and architecture so that new features meet high standards for reliability and scalability.
- Participate in incident response, including root cause analysis and long-term reliability improvements.
- Contribute to capacity planning, cost optimization, and performance tuning of large-scale systems.
- Build and maintain automation and tooling that reduces manual effort, accelerates delivery, and minimizes human error.
- Explore and apply AI/ML technologies (e.g., anomaly detection, predictive scaling, automated alerting) to enhance SRE practices.
- Share expertise with peers by documenting best practices, solutions, and troubleshooting methodologies.
- Collaborate across infrastructure, development, and business teams to align on standards and reliability goals.
- Provide technical depth and decisive action during critical incidents.
The Skills You'll Bring:
- 5–7+ years of experience in software engineering, SRE, or DevOps roles supporting large-scale, highly available systems.
- Strong skills in performance troubleshooting, root cause analysis, and distributed system optimization.
- Proficiency in at least one programming language (Python, Go, Java, or similar) with ability to write production-quality code.
- Hands-on experience with observability platforms (e.g., Splunk, Datadog, SignalFx, Prometheus, OpenTelemetry).
- Strong knowledge of AWS services, cloud deployment models, and cost optimization strategies.
- Experience with Infrastructure as Code (Terraform, CloudFormation) and configuration management (Ansible, Chef, Puppet).
- Solid understanding of distributed systems concepts (scalability, high availability, fault tolerance).
- Experience in incident management and driving operational improvements.
- Exposure to AI/ML or AIOps tools for anomaly detection, predictive analytics, or automated incident response (preferred but not required).
- Effective communication skills with ability to work across engineering and business teams.
- Bachelor’s degree in Computer Science, Engineering, or equivalent experience.
Supporting a diverse and inclusive workforce is important to Shutterfly not only because it directly reflects our value of Embracing our Differences, but also because it’s the right thing to do for our business and for our people. We welcome all applicants and evaluate them based on their qualifications, without regard to age, race, creed, color, national origin, ancestry, marital status, affectional or sexual orientation, gender identity or expression, disability, nationality, sex, or other characteristic covered by law. Learn more about our commitment to Diversity, Equity, and Inclusion on our Career Site.
This position will accept applications on an ongoing basis until filled.
The compensation package for this role is based on multiple factors, such as job level, responsibilities, location, and candidate experience. The base pay ranges included below are specific to the locations listed, and may not be applicable to other locations.
California : [$106,000-151,000]
Connecticut and New York: [$106,000-138,250]
Colorado, Illinois, Minnesota and Washington: $106,000-128,000]
Nevada: [$99,750-138,250]
Maryland and New Jersey: [Min Range Zone 2-138,250]
Hawaii : [$99,750- 112,750
This position may be eligible for a bonus incentive, health benefits, a 401K program, and other employee perks. More details about our company benefits can be found at https://shutterflyinc.com/benefits/.
This opportunity can be remote, but candidates must reside in a state in which Shutterfly is registered to do business. This includes all US states except District of Columbia, North Dakota, Mississippi, Rhode Island, Vermont, and Wyoming.
This position will accept applications on an ongoing basis until filled.
#SFLYTechnology