Senior Site Reliability Engineer

Tech DevelopmentHybrid Remote, London, United Kingdom

Description

Position at Moneycorp

Welcome to Moneycorp

We’re delighted you’re interested in being a part of Moneycorp. In the last decade, Moneycorp has transformed from a largely domestic, consumer-focused provider of foreign exchange to an end-to-end global payments’ ecosystem. With two banking licenses and operations across the entire value chain of the international payments and foreign exchange sectors, we enable businesses, institutions, and individuals to thrive beyond borders.

We help our clients realise their growth ambitions by providing them with worldwide reach, relentless regulatory excellence, and tailored, relevant solutions that resiliently optimise their financial operations. We’re fervent about pursuing our goals, making substantial contributions to the payments industry, and consistently offering unwavering support to our clients at every stage of their journey.

Moneycorp is a place where energy, commitment to our shared success and collaboration are core to our DNA. We’re restless in our drive to surpass the expectations of our clients and unlock opportunities to support them at every stage of their journey. The foundation of our success is our people, and nurturing a culture of belonging for all of our colleagues is central to our journey as a global business.

Find out more about Moneycorp’s offering, global footprint and capabilities here:

About Us | moneycorp

Your Next Challenge

Our Technology Journey:
We are at a defining moment in our technology transformation. Having built strong foundations in traditional infrastructure and networking, we are now accelerating towards a cloud‑native, highly automated, and reliability‑first engineering culture. This shift represents more than technical evolution, it is a strategic redesign of how we architect platforms, deliver services, and scale with the expectations of a modern financial institution.

As we modernise legacy systems, embed DevOps and observability best practice, strengthen operational resilience, and move decisively into cloud native patterns, we are laying the groundwork for our next decade of growth across payments and FX.
Why this matters for you

Joining our team at this stage means shaping the destination, not simply following the roadmap. You will:

Lead the evolution from legacy IaaS to performant, resilient, cloud‑native services.
Define and embed the reliability, observability, and automation practices that guide how engineering teams design and operate platforms.
Act as a coach, engineer, strategist, and technical leader across multiple teams and functions.
Work deeply across technology, product, sales, legal, and operations to translate reliability into measurable business value.

This transformation is rich with complex systems challenges, and equally rich with opportunities for innovation, influence, and engineering excellence.

Key Responsibilities:

Reliability Engineering & Observability

Define and maintain SLOs/SLIs and error budgets for critical services.
Design, build and enhance observability tooling and pipelines (metrics, logs, traces) used by engineering teams.
Create and maintain dashboards for golden signals.
Develop and update incident runbooks; lead post-incident reviews.
Approves SLO/SLI targets and error budgets for Tier-1 services

Proactive Monitoring, Capacity & Performance

Implement anomaly detection and predictive monitoring.
Perform capacity forecasting for cloud and IaaS workloads.
Tune systems for throughput and latency improvements.
Use telemetry to identify and resolve performance bottlenecks.

Automation, DR & Resilience Testing

Automate backup, restore, and failover processes.
Validate RTO/RPO targets through regular DR tests.
Design and execute chaos engineering experiments.
Enhance self-healing and rollback/roll-forward automation.

Operational Excellence & Incident Leadership

Acts as incident commander during SEV-1/SEV-2 events; authorises rollback/run decisions.
Drive root cause analysis and implement permanent fixes.
Eliminate toil via scripting and automation.
Standardise reliability patterns across teams.

Risk, Compliance & Service Mapping

Map dependencies for important business services.
Conduct severe-but-plausible scenario tests.
Supporting legal in drafting precise, customer oriented SLAs informed by real operational behaviour.
Signs off on resilience test results and impact tolerance evidence.
Document and evidence compliance with impact tolerances.
Support third-party resilience assessments.

Refactoring & Modernisation

Identify problematic platform components impacting scale/resilience.
Prioritises reliability-driven refactors and approves migration patterns.
Engineer replacements using modern patterns (e.g., containerisation, service mesh).
Lead migration strategies with measurable reliability outcomes.

Skills, Qualifications and Experience Required:

Knowledge and Experience:

Site Reliability Engineering: 7+ years in SRE, platform, or systems roles with production ownership of high-availability, low-latency platforms.
Cloud Platforms (Azure): Deep experience with Azure services including IaaS, ASE, AKS/ARO, VNets, App Gateway, Azure SQL/SQL Managed Instance/On-Prem SQL, Service Bus, Event Hubs, Kafka and Key Vault.
Secure-by-Design: Strong background in architecture governance, design reviews, and change management.
Demonstrated expertise in security-by-design, Zero Trust principles, and compliance with regulatory frameworks.
Infrastructure as Code (IaC): Proven use of Terraform for modular infrastructure design, policy enforcement, and environment provisioning.
CI/CD Pipelines: Experience with Azure DevOps and GitHub Actions for automated build, test, and deployment workflows. Hands-on experience with infrastructure as code (Terraform/Bicep), CI/CD pipelines, and automation.
Observability & Monitoring: Hands-on with Prometheus, Grafana, Open Telemetry, and log aggregation tools; building dashboards and alerting policies. Knowledge of observability and reliability engineering (SLOs, error budgets, monitoring, AIOps). Experience with FinOps practices, cost optimization, and cloud commercials (EA, reservations, savings plans).
Incident Management: Leading SEV-1/SEV-2 incidents, conducting post-mortems, and driving root-cause elimination.
Disaster Recovery & Resilience Testing: Designing and validating RTO/RPO targets, executing chaos engineering experiments, and automating recovery.
IaaS & OS Engineering: Strong background in Windows Server (2019/2022/2025) and Linux (RHEL/Ubuntu) across Azure IaaS.
Payments & FX Platforms: Familiarity with payments orchestration, FX workflows, and platform refactoring to improve scale and resilience.
Operational Resilience: Understanding of UK regulatory expectations (FCA/PRA) including impact tolerances, service mapping, and scenario testing. Track record in incident management, DR/BCP testing, and resilience planning.

Skills:

Site Reliability Engineering

Demonstrates deep expertise in Site Reliability Engineering practices, including SLO/SLI definition, error budgeting, and incident response leadership.
Leads major incident response as technical commander, driving root-cause analysis and continuous improvement in MTTR and alert quality.
Refactors legacy platform components and replaces them with modern, scalable, and resilient architectures, particularly within payments and FX systems.

Cloud Engineering (Azure), IaaS (Windows & Linux)

Applies deep knowledge of Azure services (AKS, VNets, App Gateway, Service Bus, Event Hubs, managed databases).
Designs resilient cloud architectures using multi-region, zonal services, and geo-replication strategies.
Engineers and supports Windows Server and Linux platforms across IaaS environments, including patching, hardening, identity integration, and backup strategies.
Manages OS lifecycle, configuration management, and virtualisation across Azure IaaS.

Observability & Monitoring

Expert‑level proficiency in Prometheus, including metric design, optimisation, alerting strategies, and ecosystem tooling.
Designs and operates observability frameworks using metrics, logs, traces, and synthetic monitoring.
Builds dashboards for golden signals and configures alerting policies to reduce noise and improve response times.
Implements proactive monitoring and anomaly detection strategies to identify reliability risks.
Tunes systems for performance and scalability, optimising throughput, latency, and tail-latency.
Performs capacity forecasting and baseline performance analysis across cloud and IaaS workloads.

Resilience, Security and Compliance

Engineers and validates disaster recovery and failover processes, ensuring RTO/RPO targets are met.
Plans and executes chaos engineering experiments to uncover hidden failure modes.
Automates recovery and self-healing capabilities using scripting and platform-native features.
Integrates and manages identity and access controls across AD, AAD, and OS-level authentication mechanisms.
Applies security best practices in OS hardening, certificate management, endpoint protection, and firewall configuration.
Maps dependencies and conducts operational resilience testing aligned to UK regulatory expectations (FCA/PRA).
Documents and evidences compliance with operational resilience standards and severe-but-plausible scenario testing.

Infrastructure as Code & Automation

Uses Infrastructure as Code tools such as Terraform to provision and enforce consistent infrastructure patterns.
Builds and maintains CI/CD pipelines using Azure DevOps and GitHub Actions.
Applies strong scripting skills in PowerShell, Bash, and Python to automate platform operations and reliability tasks.
Implements policy-as-code and configuration drift detection for compliance and governance.

Education:

Bachelor’s degree in computer science, Engineering, or a related technical discipline, or equivalent hands-on experience in platform engineering and reliability roles.
Certifications (Preferred, Not Mandatory):
- Microsoft Azure: AZ-104 (Administrator), AZ-400 (DevOps), AZ-700 (Networking)
- Kubernetes: Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD)
- HashiCorp: Terraform Associate

Professional Development:

Commitment to continuous learning in cloud-native technologies, automation, and operational resilience frameworks (e.g., NIST CSF, ISO 27001, FCA/PRA guidelines).

Languages:

English: Fluent in written and spoken communication.
Additional languages are a plus for vendor and global stakeholder engagement.

Personal Attributes:

Strategic thinker with a product mindset for platform operations.
Calm under pressure, especially during incidents and high-impact events.
Security-conscious and risk-aware without compromising delivery speed.
Customer-focused, prioritizing developer experience and business outcomes.
Data-driven, using metrics and analytics for decision-making.
Collaborative leader, fostering inclusion, learning, and continuous improvement.
Excellent communicator, able to influence and align diverse stakeholders.

Interested?

If the role sounds like you, we invite you to upload a copy of your CV and can do this by clicking on the Apply Now button

Fostering a culture of belonging and inclusivity

We're committed to creating a workplace where every individual feels valued, respected, and included. As an Equal Opportunity Employer, we actively cultivate an inclusive culture where diversity thrives, and we empower our colleagues to drive meaningful change within our organisation through initiatives like our DE&I focus groups and value champion network.

Like many of our peers, we recognise that fostering inclusivity is an ongoing journey, and we remain steadfast in our commitment to progress. By measuring our efforts through regular assessments and listening to the feedback of our employees, we strive to ensure that our initiatives are impactful and responsive to the evolving needs of our workforce.

Together, we want to build a workplace where everyone can bring their authentic selves to work, as we believe this is the foundation of innovation, creativity, and collective success.

Connect with us

For company news, announcements and market insights, visit our News Hub.

You can also find Moneycorp on Facebook, Twitter UK, Twitter Americas, Instagram, LinkedIn, where you can discover how we are leading the way in global payments and currency risk management.

Apply Apply Later

← Back to Current Openings

Moneycorp Careers

Senior Site Reliability Engineer

Description

Similar Jobs

Senior Site Reliability Engineer

Description

Share

Similar Jobs