Principal Streaming Data Engineer

Technology and R&D New York, New York

Position at Medidata Solutions

Medidata: Conquering Diseases Together

Medidata is leading the digital transformation of life sciences, creating hope for millions of patients. Medidata helps generate the evidence and insights to help pharmaceutical, biotech, medical device and diagnostics companies, and academic researchers accelerate value, minimize risk, and optimize outcomes. More than one million registered users across 1,400 customers and partners access the world's most-used platform for clinical development, commercial, and real-world data. Medidata, a Dassault Systèmes company, is headquartered in New York City and has offices around the world to meet the needs of its customers. Discover more at

Your Mission:

Medidata is the leader in developing technologies that allow our customers to get new medicines to patients faster. Building on our long history of delivering world-class clinical applications to the life sciences industry, Medidata’s Data Fabric organization is staffed with a passionate team of technology and scientific experts, tackling the industry’s most difficult technical challenges in order to push the boundaries of possibility for our clients and most importantly, for patients. Together we can deliver meaningful advanced digital transformation to the industry in order to achieve our vision.

The Principal Streaming Data Engineer role is an individual contributor position, and as such, does not have direct reports. Nonetheless, the role involves technical guidance over a team of streaming data engineers as well as broader thought leadership to advance stream data processing across the R&D organization.

As a Principal Streaming Data Engineer, you will:

  • Design and guide the implementation of an ecosystem of event-driven, distributed, stateful data pipelines that integrate and analyze diverse sources of clinical, sensor and real-world patient data to support biomarker discovery;

  • Champion and drive the development of incremental online machine learning algorithms to solve out-of-core analytical problems;

  • Proactively support and enhance the use, governance and evolution of a growing, cross-functional, multi-format type registry of schemas for streaming data algorithms and APIs;

  • Identify, interpret and communicate meaningful insights from streaming sources to make them accessible to patients, clinical research professionals and other stakeholders;

  • Provide thought leadership around event-driven architecture and streaming data solutions and act as a force-multiplier to help other data engineers who are less experienced in the practice of stateful stream processing;

  • Document, compare, critique, recommend and advocate alternative data architecture and data modeling solutions; and

  • Evangelize event-driven architecture and online learning solutions across the R&D organization.

Your Competencies:

  • Extensive experience designing and building event-driven data pipelines using Kafka and Flink in a Kubernetes environment at scale

  • Deep experience working and deploying in the AWS Cloud, especially its file and data-oriented services

  • Extensive experience designing enterprise and/or industry data models and data-oriented APIs, preferably using GraphQL as well as semantic, linked-data technologies like RDF, OWL, and SHACL

  • Experience using schema registries to serialize and deserialize event streams in a backwards compatible way, using, for instance, AVRO and Protobuf

  • Experience measuring and ensuring data quality

  • Considerable experience using a variety of relational, columnar, key-value, document and graph database technologies, including, for instance, Postgres, MySQL, Cassandra, Dynamo, Mongo, Athena, Elasticsearch, Redis, Neptune, Stardog, Anzograph or other triple stores

  • Expert level SQL coding knowledge, including writing and optimizing complete queries

  • Demonstrably skilled coding in at least two languages other than SQL, like Scala, Python, Java, Go, Typescript, or Rust in the context of data-oriented problems

  • Experience analyzing time series data with streaming or batch algorithms

  • Experience applying machine learning algorithms to large data sets, particularly online learning techniques

  • Experience applying agile software development methodology to enterprise dat engineering, with tools like Git, JIRA, Travis and others

Your Education and Experience:

  • Masters in science, engineering, math, data or relevant field

  • Minimum of eight (8) years working in software development and at least four (4) years of experience working in event-driven data engineering or data science with relevant big data streaming technologies

  • Must be located in the continental US

Medidata is making a real difference in the lives of patients everywhere by accelerating critical drug and medical device development, enabling life-saving drugs and medical devices to get to market faster. Our products sit at the convergence of the Technology and Life Sciences industries, one of most exciting areas for global innovation. Nine of the top 10 best-selling drugs in 2017 were developed on the Medidata platform. 

Medidata Solutions have powered over 17,000+ clinical trials giving us the largest collection of clinical trial data in the world. With this asset, we pioneer innovative, advanced applications and intelligent data analytics, bringing an unmatched level of quality and efficiency to clinical trials enabling treatments to reach waiting patients sooner.

Medidata Solutions, Inc. is an Equal Opportunity Employer. Medidata Solutions provides equal employment opportunities to all employees and applicants for employment without regard to race, color, religion, gender, sexual orientation, gender identity, national origin, age, disability status, protected veteran status, or any other characteristic protected by the law. Medidata Solutions complies with applicable state and local laws governing non-discrimination in employment in every location in which the company has facilities.