Reliability Engineer (DevOps)
As a Reliability Engineer, you will be designing, building and operating features and services that makes Xi, Nutanix cloud services to be secure, reliable, completely elastic, scalable, and self-healing. Delivering reliable and high-performance services and features. Nutanix requires engineers with exceptional expertise and boundless creativity.
- Work in concert with engineering teams to evolve services for better scalability, reliability and development velocity
- Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews
- Maintain services once they are live by measuring and monitoring availability, latency and overall system health. Focus on improving Reliability
- Practice sustainable incident response and blameless postmortems
- Define and develop software for tasks associated with the developing, designing and debugging of applications
- Develop tools to improve ability to rapidly deploy and effectively monitor custom applications in large scale environments
- Participate in a 24x7 on call rotation
- Highly skilled at one or more domains: Infrastructure As Code tools (Docker, Terraform, Puppet, Helm), Monitoring tools (Prometheus, Datadog, NewRelic), Container Orchestration tools (Kubernetes, Docker), Database technologies (Cassandra, Postgres), CI/CD tools(Jenkins, Spinnaker)
- Be proficient in GCP and AWS Cloud
- Experience automating tasks with scripting languages such as Python, Go and/or Shell
- Deep understanding of service metrics and alarms through the development of dashboards, service KPIs, alarming systems
- Knowledge of Apache Kafka, Druid
- Strong understanding of Linux operating systems
- Familiar with setup and architecture of queuing, caching and service mesh systems
- Experience in applying SRE principles and best practices
- 3-8 years of relevant work experience
- Experience working in an operational environment with mission critical tier-one services with associated on-call support
- Designed Monitoring, Logging and Reliability Processes for systems at scale
How do I know if this role is for me
- Do you like thinking about large scale problems that have a lot of moving parts?
- Do you like thinking about how to make large systems more reliable?
- Are you okay with working on software that will likely never be overtly seen by an external user?
- Do you enjoy the process of diagnosing and fixing a problem?
- Do you like looking through metrics and logs as if it were a treasure hunt ?
- Do you enjoy thinking about system information (e.g. disk space, cpu, os, kernel, etc.) and system level functionality (e.g. ssh, proc, cron, swaps, etc.)?
- Are you comfortable with the idea of being “on-call” in which you are likely to be in high-stakes scenario where something needs to be fixed?
- Are you able to stay calm under pressure?
- Do you approach problems in a logical, process-oriented way?
- Are you comfortable attempting a problem that has never been solved before?
- Are you someone who thinks about how you can make things better?
Nutanix is an equal opportunity employer.
The Equal Employment Opportunity Policy is to provide fair and equal employment opportunity for all associates and job applicants regardless of race, color, religion, national origin, gender, sexual orientation, age, marital status, or disability. Nutanix hires and promotes individuals solely on the basis of their qualifications for the job to be filled.
Nutanix believes that associates should be provided with a working environment that enables each associate to be productive and to work to the best of his or her ability. We do not condone or tolerate an atmosphere of intimidation or harassment based on race, color, religion, national origin, gender, sexual orientation, age, marital status or disability.
We expect and require the cooperation of all associates in maintaining a discrimination and harassment-free atmosphere.