Director, Site Reliability Engineering

Technical Operations Open, Any


Position at Smarsh

Director, Site Reliability Engineering

If you are passionate about working with vast amounts of data, obsess over performance and the quality of code, and have a desire grow professionally while having a significant impact on a business, then Smarsh is the right company for you. Our Professional Archive product is constantly growing, and with that growth, we are rapidly expanding our teams to increase our rate of innovation and ensure an excellent customer experience. This product ingests data from our customers 24/7, and working here, you will have exposure to dealing with over 20PBs of raw data, database tables and Solr indexes with 10s of billions of records, and an ingestion pipeline for data that is able to process thousands of communications per second.

The right engineering leader for Smarsh is someone with a strong technical background, having worked on a variety of products and projects as an engineer. We value leaders with a track record of success, but who are also familiar with failure, having learned valuable lessons from both. Our engineering leaders still consider themselves engineers and have a deep-rooted passion for engineering excellence, which they foster within their teams. We look for leaders who have a desire to review code, discuss design, tooling, and architecture with engineers, and view themselves as both an engineering enabler and mentor for their teams.

If you have ever dreamed of making an enormous impact through your contributions or aspired to take on additional responsibilities and grow professionally, then Smarsh would love to have you join us. We are large enough to have the benefits of a big company but small enough that every engineer matters to us.

To learn more about us, visit www.smarsh.com

Essential Functions 

Manage day to day operations of our SaaS on-prem platform using a data driven approach to ensuring health and performance remain within our SLAs and SLOs while scaling our data center, public, and private cloud infrastructure.

Manage our incident response and escalation processes, including stakeholder communication, demonstrating continuous improvement as we incorporate feedback into the processes. Demonstrate fully accountability and ownership for platform disruptions and manage incidents through complete resolution.

Adopt SRE/DevOps best practices and apply them to critical initiatives and transformation activities across teams; act as a strategic thought leader in this space for the broader engineering organization and coach the team to develop their skillsets through knowledge sharing, documenting, and acting as a role model for behaviors and attitude

The Director of Site Reliability Engineering will lead a large, globally distributed, diverse team to provide around-the-clock coverage for the platform. Present to both internal and external audiences on a myriad of topics including roadmap, platform health (key metrics), and all aspects of incidents. Lead the forecasting, planning, and execution around creating greater scalability, availability, and reliability of the platform, taking into consideration security and observability; a willingness to get hands-on at times to lead through doing. Have an unrelenting drive for producing increased observability and alerting throughout the platform while guiding the engineering teams toward a more automated, autoscaling, and self-healing architecture. Lead and manage a growing team of Site Reliability and DevOps engineers fostering a strong team culture, continuously monitoring morale, and consistently delivering quality outcomesChampion the move towards Agile/Scrum for the SRE and DevOps organization, creating greater visibility into the day-to-day activities of the team.

Experience

Our engineers use a mix of Macs and PCs, our code is almost entirely in Git, and our architecture makes significant use of queueing and caching through a mix of technologies (SQL, Memcached, Redis, ActiveMQ). We are in transition to a streaming architecture (Kafka) so experience with data streams and events is beneficial. Our code base is largely split evenly between Java and C# with a mix of both legacy and modern framework and deployments. The platform is evolving away from a monolith, and we have fully embraced the move to containers (Docker) and orchestration (Kubernetes, EKS) as we refactor. We have a mix of CI/CD systems (Circle CI, Concourse, Jenkins, Travis) while we continue down a path toward tooling convergence. The server environment is split between Windows and Linux systems with a mix of configuration management tools (Terraform, Ansible, ZooKeeper).

We are looking for people whose work experience and engineering interests align well with our environment and encourage people to apply even if they aren’t strong in all the technologies we use daily.


Work Authorization/Security Clearance (if applicable)

We are not providing sponsorship at this time.

About Smarsh

Smarsh is the leader in communications compliance, archiving, and analytics. We empower our customers to manage the risk and unleash the intelligence in their digital communications. Our growing community of over 6500 customers in regulated industries counts on Smarsh every day to help them spot compliance, legal or reputational risks before those risks become regulatory fines or headlines.   

  

We provide compliance across the broadest set of communications channels with insights on what’s being captured. Smarsh customers manage over 500 million daily conversations across 80+ unique communication channels. Customers include the top 10 U.S., top 8 European, top 5 Canadian, and top 3 Asian banks. The Smarsh advantage is customers stay ahead of compliance and uncover patterns and relationships hidden within their data. At Smarsh, we’ve been helping our customers manage new forms of communication since 2001. We work closely with regulators including the SEC, FINRA, IIROC, the PRA and FCA, and with our customers, to ensure that they understand the capabilities of today’s technology and that our platform meets their most stringent requirements.  

Relentless innovation has fueled our journey to consistent leadership recognition from analysts like Gartner and Forrester, and our sustained, aggressive growth has landed Smarsh in the annual Inc. 5000 list of fastest-growing American companies since 2008. 

 


#LI-Remote