Site Reliability Engineering Manager - ObjectRocket
Site Reliability Engineering Manager
ObjectRocket helps customers solve database challenges so they can focus on building their applications. We leverage Kubernetes to build and manage a database as a service platform that supports open-source database technologies. Our team guides clients to the right data solution, acting as an extension of their team to manage their database infrastructure.
We are looking for a Site Reliability Engineering (SRE) Manager who exemplifies the attributes of a leader, mentor, decision maker, and engineer. Our SRE team is tasked with adding 9’s to our current database as a service platform, by applying software engineering practices and discipline to consistently improve our processes and systems. The SRE team at ObjectRocket is a group of engineers who are tasked with solving complex operational problems at scale; We balance the tactical and strategic by prioritizing what needs to be fixed now with designing sustainable software to ensure growth of our people, platform and systems. We work with other teams to provide guidance throughout the lifecycle of building, shipping, and operating the ObjectRocket platform. You’ll be working on AWS, Azure, GCP, and OpenStack to deliver solutions to our customers, mentoring other talented engineers in k8s best practices, and solving problems in the realm of App Transformation, Big Data, and IoT.
- Manage and report team success against Key Performance Indicators (KPIs) via regular operations and business reviews
- Serve as the project manager for SRE initiatives and drive them to completion while clearly communicating status to the rest of the organization
- Ability to roll up your sleeves and contribute technically to the team, when applicable, to improve our products
- Provide leadership and prioritization to the SRE team and help make the key trade-offs required to keep the team working most effectively
- Foster and grow a healthy and happy team of engineers; Hire effectively, provide team members with compelling career paths, and help to maintain our culture
- Take a critical leadership role in our incident management process, ensuring a timely resolution and proper actions are taken in response
- Key stakeholder in our post incident retrospective, contributing to authoring customer incident reports and ensuring the remediation steps are completed on target and on time
- Build relationships with product and support to influence the priorities of other teams as well as gather feedback on SRE team priorities
- Ensure on-call and time-zone coverage of engineering resources
- Manage customer escalations, interacting directly with our largest customers as needed
- Participate in the leadership on-call rotation (weekly) & occasional travel
- You understand the value of shipping quickly and of software craftsmanship, and have the judgment to know when to apply each.
- Experience leading, motivating, and mentoring fast moving, highly-skilled infrastructure engineering teams; adept at navigating and improving both social and technical systems.
- Previous experience running production-grade software at scale and an appreciation for the complex and emergent behaviors inherent to distributed systems.
- Someone comfortable identifying technical and process-related shortcomings, and who can lay out a vision to fix them, and isn’t afraid to institute change by experimentation.
- BS degree in Computer Science or related technical field, or equivalent practical experience
- Experience in all the common technologies in a modern cloud-based SaaS offering: Jenkins, Git, Atlassian Jira and Confluence, Linux, DNS, E-mail, containers, log analysis, monitoring
- Direct Experience with much of the following:
- Container Orchestration systems such as Kubernetes, Mesos, Swarm or similar
- Service Discovery systems such as Etcd, Zookeeper, Consul or equivalent
- Networking systems like Envoy, Istio, flannel, or equivalent
- QC and monitoring systems like Twistlock, SonarQube, Prometheus, cAdvisor, DataDog
- Linux system administration
- AWS, Azure, or GCP
- RDBMS, NoSQL, and Big Data solutions
- Familiar with using git, jenkins, and similar ci/cd tools
- Working knowledge of languages like GoLang, Python, Ruby, or Scala
- Understanding of platform level concerns, such as configuration management, network request routing, blue/green or canary deployments, etc.
Why Work Here?
- People like it here. We do fun stuff like Taco Thursday, random potlucks, chili cook-offs, and a Thanksgiving meal. We also celebrate transparency by hosting weekly 22-minute meetings where everyone gets an update on what’s going on in their team. We also do “Festivus” which is a monthly wrap-up of how the business is doing along with monthly goals.
- Your work will make a difference. If you are the type of person that wants to know what you are doing matters, you’ll like our data-driven approach.
- We make money in a way you can be proud of. We provide Database-as-a-Service solutions for all types of customers ranging from plucky startups to seasoned companies. We want them to be successful because it helps us grow and deliver consistently positive support. Customer Service is our number one priority.
- If you are genuinely interested in helping make our next generation product that people love to use, treats them with respect, helps them achieve their goals, and want to help us continue to build a sustainable business, we look forward to hearing from you.