Site Reliability Engineer (SRE)

Software EngineeringRemote, United States Boston, Massachusetts


Description

WHO ARE YOU? You welcome the opportunity to problem solve, proactively find remedies for technical issues and you do your best work in a fast-paced production and deadline driven environment. 
 
WHO ARE WE?  We are passionate, innovative, lifelong learners, and creative thinkers working to develop culturally authentic language learning products for K-12 schools and universities.    
 
WHAT IS THE ROLE ABOUT? As our Site Reliability Engineer (SRE) you will be responsible for making ongoing operational visibility enhancements within the DevOps team to improve the observability of our applications and cloud infrastructure by increasing the coverage and scope of our monitoring, creating on-call rotations and introducing new tools to our site reliability tool stack. You will also be responsible for troubleshooting outages and other incidents that affect the availability and reliability of our web-based service offerings, documenting incidents and participating in outage recovery.  
 
WHY IS THIS EXCITING? VHL is a growth organization. Our Engineering team creates, hosts, and maintains the systems used to deliver the online learning platform and associated digital assets.  
 
IN THIS ROLE YOU WILL:          
  • Continually improve and enhance operational visibility into our services and infrastructure so that the team is aware of potentially serious issues before they result in an outage.  
  • Quickly respond to outages and site reliability incidents when they occur, while working closely with the team to troubleshoot and recover from such incidents.   
  • Proactively investigate and document site reliability incidents with detailed reporting of any factors that contributed to the incident and the recovery process, as well as thorough and detail-oriented root cause analysis.   
  • Research, suggest and implement new tools and services that will improve overall operational visibility and monitoring/alerting capabilities.   
  • Effectively take direction from supervisor as well as senior members on the team and deliver solutions based on a clearly defined set of requirements.   
  • Regularly engage with our security team to help remediate security-related vulnerabilities and incidents by investigating data such as logs and suggesting patches and fixes to improve our overall security posture.   
  • Effectively and efficiently manage on-call rotations to ensure 24/7 SRE coverage.  
  
YOU MUST HAVE (MINIMUM REQUIRED SKILLS & EXPERIENCE)       
  • HS diploma/GED minimum
  • 3+ years of experience working as a Site Reliability Engineer (SRE) in a high traffic production environment
  • 3+ years of experience working in a *nix environment and proficiency with command line tools
  • 3+ years of experience using cloud-based monitoring and analytics platforms such as Datadog
  • Well-developed technical knowledge and experience working hands-on with Amazon Web Services (AWS)
  • Familiarity with Docker and container orchestration tools
  • Well-developed experience troubleshooting complex networking issues
  • Hands-on experience using programming languages such as Ruby or Python
  • Well-developed and hands-on knowledge of relational databases and services such as AWS Relational Database Service (RDS)
  • Well-developed experience with Git and GitHub
  • Experience reading, analyzing, logging and monitoring data related to the health of cloud services
  • Experience troubleshooting systems and networking issues with strong hands-on Linux experience 
  • Prior work experience in a DevOps or other Engineering role that involved documenting incidents and preparing Root Cause Analysis (RCA) or After Action Reports (AAR) 
  • Strong written, verbal and interpersonal communication skills in order to articulate complex technical information to non-technical stakeholders
  • Willingness and availability to work on an on-call basis outside of normal business hours as needed
IDEAL IF YOU HAVE (PREFERRED SKILLS & EXPERIENCE)          
  • Project Management experience
  • Experience with Kubernetes and/or Elastic Container Service (ECS)
  • Industry familiarity and experience in edtech

 

LOCATION: Remote/Hybrid-Remote (Boston)
This position can be fully remote within the continental United States or local to Boston, MA. If remote, travel to Boston, MA office location a few times per year will be required. If local to Boston, may be required to work a 2-day Hybrid in-office/3-day remote schedule from our Boston office location. Candidates must reside in the United States to be considered. Relocation is not available for this position. 



We are passionate, innovative, lifelong learners, and creative thinkers working to develop culturally authentic language learning products for K-12 schools and universities. Our benefits package includes life/health/dental/vision insurance, 401(k), educational assistance, commuter pass subsidies, generous employee referral bonuses, PTO and paid holidays.

Vista Higher Learning is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, age, color, religion, sexual orientation, gender identity, national origin, physical or mental disability, and/or protected veteran status or other characteristics protected by applicable law.

ACCESSIBILITY NOTICE: If you need a reasonable accommodation for any part of the employment process due to a physical or mental disability, please send an email to: [email protected]

Links to OFCCP EEO POSTER & SUPPLEMENT: https://www.dol.gov/ofccp/regs/compliance/posters/pdf/eeopost.pdf https://www.dol.gov/ofccp/regs/compliance/posters/pdf/OFCCP_EEO_Supplement_Final_JRF_QA_508c.pdf