Cloud Site Reliability Engineer

Software Engineering Yarmouth, Maine United States


Description

In this role, you will respond to production incidents and support large-scale, containerized cloud applications, partnering closely with engineering, product support, and infrastructure teams to strengthen incident response, reduce operational toil through automation, and proactively mitigate reliability risks before they impact customers.
Your work will directly support SLA commitments and enable our ERP solutions to scale confidently while delivering a dependable, high-quality customer experience.
Responsibilities
  • Implement and enhance observability solutions for AWS EKS-based ERP systems.
  • Monitor system health, performance, and availability, ensuring alerts are actionable and contributing to SLO/SLA alignment.
  • Implement incident management tooling and best practices, and drive incident response partnership with engineering and cloud operations teams for timely resolution and clear communication.
  • Conduct root cause analysis and lead cross-functional incident retrospectives with engineering, cloud operations, and product support, helping identify preventative improvements that reduce recurring issues, reliability risks, and improve overall system resiliency.
  • Develop automation and operational tooling to reduce manual effort and operational toil.
  • Participate in a structured on-call rotation during primary business hours.
Qualifications
  • 2–3+ years of experience in Site Reliability Engineering, Cloud Operations, DevOps, or Software Engineering in cloud-based environments.
  • Hands-on experience with monitoring and observability tools such as Datadog and CloudWatch.
  • Experience participating in incident response processes and using tools like PagerDuty and JSM Operations.
  • Working knowledge of Kubernetes (EKS experience preferred), with experience supporting containerized applications.
  • Experience with Infrastructure as Code tooling such as Terraform.
  • Proficiency in at least one scripting or programming language such as Python, Bash, C#, or .NET Core.
  • Experience with CI/CD practices and modern DevOps methodologies.
  • Demonstrated curiosity, strong problem-solving skills, and a desire to deepen expertise in reliability engineering.
  • Clear communicator who thrives in collaborative, cross-functional environments.
  • Bachelor’s degree in Computer Science or related field