Senior DevOps Engineer
Site Reliability Engineer
RMS is looking for engineers that are fascinated by failure, and passionate about finding and solving problems in distributed systems. Do you like to exercise all possible ways complex systems fail and engineer reliability features into a Bigger than Big Data Analytics platform that is at the heart of a Trillion-dollar industry transformation? If so, this role is designed for you.
We are building a new platform that:
- is a highly scalable, cloud-based SaaS offering that helps our clients understand and manage risk
- is based on Linux, Java, and open source technologies, and leverages the latest advances in database tools, vector processing, hardware-based acceleration techniques, and geographic visualization tools
- utilizes a unique Big Data approach scaling to massive sizes over time and large scale distributed data processing technology
You will contribute by:
- Leveraging your knowledge of High Availability, Scalability, Reliability, and Efficiency for distributed systems to influence and improve our SaaS services working with the architecture, engineering and infrastructure teams.
- Implementing observability strategies and tools for proactive detection of failures and deliver visibility into services metrics.
- Working with our security experts to automate security into the platform and services.
Required Experience and Skills
- Expertise programming experience with Java
- Expertise in Linux software development
- Building observability into services via instrumentation, logging, and tracing.
- Good understanding of microservices concepts/architecture
- Experience developing cloud services and cloud platforms
- Experience with agile development and working with agile engineering teams
- Excellent communication skills, proven ability to convey complex ideas to others in a concise and clear manner
- BS/MS in Computer Science, Computer Engineering, Math, or equivalent professional experience
- Experience with Scala and Python
- Experience with HDFS, Spark, and relational databases such as Postgres
- Experience with open source monitoring and logging technologies such as Prometheus and ELK
- Experience implementing containers in a microservices environment
- Experience educating engineers about what to log, measure and alert on with an emphasis on surfacing trends to be used for SLA/O analysis
- Experience in building analysis tooling and processes to provide mature mechanics to help users understand the state of a distributed system at a given point in time.
- Experience in creating and delivering performance monitoring and insights for bespoke platforms via dashboards, scorecards and ad hoc analysis.