Senior Site Reliability Engineer - Big Data

Data and Analytics New York City, New York San Francisco, California

Description

SUMMARY

The Data Reliability Engineering team for Disney Streaming Services (DSS), a segment under the Disney Media & Entertainment Distribution (DMED), is responsible for maintaining and improving the reliability of DSS’ big data platform, which processes hundreds of terabytes of data and billions of events daily. We are looking for a Senior Data Reliability Engineer to help us in the ongoing mission of delivering an outstanding service to our users and make DSS more data-driven. Additionally, you will work closely with our partner teams on our incident management process including post-mortem, root cause analysis and preventing incident recurrence. You will be making an outsized impact in an organization that values data as its top priority. Are you passionate about reliability engineering, automation and software delivery excellence? If that is the case, we believe this is the role for you.

WHAT YOU'LL DO

  • You will work closely with the team lead on collaboration with engineering teams to improve, maintain, performance tune, and capacity plan for DSS data platforms and infrastructure.
  • You will build solutions to continually improve our software release and change management process using industry best practices to help DSS maintain legal compliance.
  • You will work closely with other members of the Data Reliability Engineering team to set project deliverables, review design documents, perform code reviews and help mentor junior members of the team.
  • You will build out observability and intelligent monitoring of data pipelines and infrastructure to achieve early and automated anomaly detection and alerting.
  • Plan service capacity and testing automation, design business continuity and disaster recovery plans and processes and work with the engineering team on implementation.

WHAT TO BRING

  • 5+ years experience working on Linux environment, and proficient with cloud environment (AWS)
  • 5+ years experience coding in one or more of the following programming language: Python, Java, or Scala
  • 5+ years of hands-on experience in Reliability Engineering for high-performant, scalable and distributed data systems with a focus on automation
  • Detailed problem-solving approach, coupled with a strong sense of ownership and drive
  • A passionate bias to action and passion for delivering high-quality data solutions
  • Deep understanding of CI/CD principles, familiar with version control systems (Git)

*LI-BL1