Tech Lead Site Reliability Engineer
Being an efective SRE is as much about how you think, as it is about your technical skills. The SRE role requires a mix of development and operations skills. Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems.
First and foremost, an SRE is a software developer that builds things. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you are expected to manage the complex challenges of scale which are unique to Nice InContact, while using your expertise in coding , systems, complexity of operating systems and large-scale system design. SRE's culture of diversity, intellectual curiosity, problem solving, and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives.
Generalists thrive in this role as an SRE. Ensuring that the Nice InContact services—both our internally critical and our externally-visible systems—have the reliability/uptime appropriate to users' needs. Additionally, SRE’s will keep an ever-watchful eye on our systems capacity and performance.
Answering the questions: How does something work? How can I make it run better? How do I know it’s working? How do I measure the performance? Now, once you have answered those questions: How do I work within the organizational departments to do this?
- Bachelor's Degree in a related field or equivalent time/experience of relevant work history that should consist of 1 - 3 years in cloud environments provisioning automation within AWS and Azure. There should also be a demonstratable software development experience.
- Managing services inside the cloud AWS/Azure connections to Enterprise infrastructure.
- Prior experience with Microsoft and Linux troubleshooting, coding/scripting, higher level languages C#, Java, etc
- Can demonstrate the SOLID principles while writing code, can follow code flow, use version control Git/GitHub. Familiar with the building of CI/CD pipelines with Jenkins
- Understanding of common scripting PowerShell, Bash, AWS CLI
- Experience managing a full application stack with high availability requirements is preferred
- Managing of both Microsoft and Linux servers and services
- Experience leveraging monitoring and alerting tools such as Grafana, Prometheus. Inspec testing for auditing. Chef scripts for reliable builds.
- Understand containerization and the orchestration of Docker/Kubernetes,
- Strong written and verbal communication skill
- Writing is our primary means of communication, from pull requests, team chat, knowledge sharing, and communicating changes. Excellent writing skills are crucial to success.
- What is TOIL? Understanding of TOIL and its characteristics, including having a drive to measure and eliminate it.
- What is an Error budget?
- What is SLI and SLO?
- Continued improvement of tech skills is a requirement. You should be learning a new tech skill each quarter. Seeking industry certifications to establish your level of knowledge
- Required: A self-motivated individual with a track record of having the internal drive and motivation to begin and continue tasks without external prodding or extra rewards.
- Maintain obtainable goals with manager
Within the duties are three main areas of focus: Reliability, Monitoring/Alerting, and Service
- Collaborate and contribute with other enterprise teams
- Communicate availability to the team and manager
- Monitoring - site reliability and health of our systems. Learn to identify those areas critical, major, minor.
- Alerting – the critical problems/errors of systems and their processes.
- Metrics - gathering data for troubleshooting of all kinds. Exposing application metrics to managing/monitoring/monitoring
- Building new features and services is a big part of this role. We are continually developing and implementing new ways to support our teams, understanding our customers’ needs, and becoming experts in site reliability.
Monitoring and Alerting
- Development and the Deployment of new tools to support our systems and services in an automated fashion
- Hardening our systems where applicable
- Supporting the deployment of new product services
- Use software development approaches to operations. You should have a breadth of experience in software development, operations, and be actively practicing site reliability principles
NICE is committed to provide an environment based on equal opportunity for all qualified applicants and employees. It is the policy of NICE to afford equal employment opportunities to qualified individuals, regardless of age, race, color, creed, religion, citizenship, ancestry, national origin, sex, gender, pregnancy, mental or physical disability, marital status, veteran status, service in the Armed Forces, sexual or affectional orientation, atypical hereditary cellular or blood traits, genetic information, status as a victim of domestic or sexual violence, and/or any other status protected by any applicable federal, state and/or local statute or regulation.