Senior Site Reliability Engineer
Job ID 25-281
Come Join Our Passionate Team! At Barracuda, we make the world a safer place. We believe every business deserves access to cloud-enabled, enterprise-grade security solutions that are easy to buy, deploy, and use. We protect email, networks, data, and applications with innovative solutions that grow and adapt with our customers’ journey. More than 200,000 organizations worldwide trust Barracuda to protect them — in ways they may not even know they are at risk — so they can focus on taking their business to the next level.
We know a diverse workforce adds to our collective value and strength as an organization. Barracuda Networks is proud to be an employer that follows all applicable national, state, and local laws about nondiscrimination and equal opportunity regardless of race, gender, religion, sex, sexual orientation, national origin, or disability.
Envision yourself at Barracuda
We are seeking a passionate, experienced Senior Site Reliability Engineer to join our Email Protection organization. We hire strong, collaborative leaders to inspire and enable teams to be successful delivering quality software.
The right candidate will have extensive experience in Site Reliability Engineering, ensuring the highest levels of uptime of the Email Gateway Defense (EGD) service. You will be working to build the Automation and Platform design, manage the day-to-day operations of the EGD system in production, fix application and infrastructural issues and deploy them to eliminate downtime.
Our products are built using modern technologies and languages and deployed to AWS (Amazon Web Services) via a mature CI/CD pipeline. Performance, monitoring, and observability are first class citizens in our ecosystem. Some products are on their journey ‘to the public cloud’ and successful running of the application without downtime is key.
What you bring to the role:
- Experience with developing, building, securing, and operating sophisticated and highly automated Cloud infrastructure in AWS a must
- Prior success in automating and maintaining an efficient large scale real-world production environment
- Extensive experience with orchestrating cloud infrastructure automation using tools like Terraform, CloudFormation and Crossplane
- Development experience with continuous integration (CI/CD) and automation tools such as GitHub, GitHub Actions, Jenkins, Packer, Ansible, Puppet, etc.
- Working knowledge with deployment patterns/strategy including blue/green, canary, rolling deployment, draining, etc.
- Comprehensive experience with containers and container orchestration tools (Docker, Kubernetes) in a Cloud Environment (AWS EKS)
- The ability to design, author, and release code in languages like Python, Go, Ruby
- Advanced Operating System skills with knowledge of Linux internals
- Extensive experience working with observability and reliability tools like New Relic, Elastic APM (Application Performance Monitoring), CloudWatch, Prometheus and Grafana
- Experience with Data pipeline engineering and tools like Databricks, Apache Spark, Kafka, DataStage
- Strong debugging skills with a systematic problem-solving approach to identify complex problems
- Ability to communicate effectively both verbally and in writing
- Self-awareness and a true teamwork spirit
- Bachelor's degree in a technology field or equivalent work experience
- Minimum of 5 years of experience in a Site Reliability Engineer (SRE) or similar role
What you will be working on:
- Write clean, high-performance, and well tested, infrastructure code with a focus on reusability (Puppet /Ansible/ Terraform/Cloudformation/Crossplane/Packer)
- Recommend and implement infrastructure best practices in alignment with standard SRE principles and supply guidance on system performance and throughput expectations.
- Troubleshoot issues across the entire stack: hardware, software, application, and network
- Establish, maintain, and adhere to Barracuda technical standards, policies, and procedures
- Build and enhance our observability and reliability systems
- Participate in an on-call rotation
- Collaborate with internal groups to design, develop, and deploy manageable, scalable, and robust services
- Perform RCA (Root Cause Analysis), partner with engineering and operation teams across the organization to roll out fixes
- Provide technical guidance and mentorship to other engineers on reliability and scalability best practices, tools, and methodologies
What you will get from us:
A team where you can voice your opinion, make an impact, and where you and your experience are valued. Internal mobility – there are opportunities for cross training and the ability to attain your next career step within Barracuda, in addition to equity, in the form of non-qualifying options.
#LI-hybrid