Site Reliability Engineer
Description
What we do?
- We select, evaluate and personalize advertisements. One million times per second, with a latency below 50 ms.
- Ensuring that all our services are available 24/7.
- Automation of routine operations (e.g. deploying new machines / scaling services).
- Detailed monitoring of all production systems.
- Detecting (and fixing!) Problems before they cause big trouble.
- Minimizing the duration and consequences of a failures.
- Fast and efficient, but controlled, carrying out non-standard operations (e.g. data migration, implementation of a new solution), without affecting operation of the system, the work of developers, etc.
- Searching for the best possible solutions and technologies to achieve the above mentioned goals.
- You perfectly understand how computers work (at every level: from hardware, through system, to software).
- You can observe, monitor and analyze the performance of production systems (and draw valuable conclusions from this obserations).
- You know various hack tools that allow you to penetrate the guts of systems (top, strace and visualvm)
- You know (not only familiar with) A lot of technologies and you are not afraid to get to know new ones.
- You can quickly understand the principles of operation of the components used (e.g. how exactly does aerospike arrange data on disk?)
- You feel comfortable in distributed systems (not only in the LAN, but also around the world).
- You know perfectly well what the theoretical performance of the equipment is ...
- ... and you can answer the question why (yet!) we do not observe it in practice.
- You are interested in both the lowest layers of the technological stack (hardware, network, OS ...) and also the business role of individual components.
- You are fluent in the code of the systems you take care of. You can easily find something there, check it and modify it if necessary.
- Not only are you able to efficiently extinguish a fire, but you can also identify its fundamental cause and propose a way to prevent similar events in the future.
- You can automate your work with Python and Bash. You are not afraid to look at any code.
- You comprehensively cover the topics you are working on.
- You understand that from proof-of-concept to production-grade software there is sometimes some work to be done.
- You solve real problems instead of endlessly analyzing completely irrelevant issues.
- Java, Python...
- Aerospike, Memcached...
- HaProxy
- Jenkins
- Grafana, Graphite, ClickHouse
- PostgreSQL
- Kafka
- Kibana, ElasticSearch...
- Jetty
- Docker
- Google Cloud…
- A very attractive salary
- Interesting project and technologies (https://github.com/RTBHOUSE).
- A challenging and constantly growing scale (currently: 4 DC's / several dozen equipment cabinets).
- Access to the latest technologies and the possibility of real use of them in a large-scale and highly dynamic project.
- Work in a team of enthusiasts who are willing to share their knowledge and experience.
- Collaboration with people who understand technology (no PMs, analysts and managers).
- Extremely flexible working conditions - we do not have core hours, we do not have holiday limits, you can work fully remotely.
- The hardware and software you need