Senior Data Engineer

TechnologyRemote, London, United Kingdom


Description

Senior Data Engineer

 

Role Overview
This role sits at the intersection of Systems Engineering, ML Infrastructure and Applied Research. You will not be building dashboards; you will be building the high-performance engines that transform raw video into production-ready model fuel for the BRAHMA AI platform.  You are the bridge between raw storage and the GPU. Your pipelines won't just move data; they will process it; extracting facial landmarks, generating embeddings, and computing complex features that directly empower our models. You'll also be working closely with the applied research team, implementing feature extractions on an ad-hoc request basis.

 

What You'll Do
  • Feature engineering at scale: Partner closely with research scientists to operationalize feature extraction logic. You may take experimental code (e.g., for facial landmark detection, audio spectral analysis, or embedding generation) and refactor it into highly parallelized, production-grade pipelines.
  • Architect unstructured data pipelines: Design and build high-throughput systems using Python and Ray to ingest, decode/transcode, and validate massive video/audio datasets.
  • Orchestrate ML workflows: Replace fragile scripts with robust, observable DAGs using Dagster (or Airflow), ensuring reproducibility, fault tolerance, and clear lineage for long-running training jobs.
  • Optimize compute & cost: Manage the trade-off between speed and budget. You will decide when to use Spot Instances, how to pack data for GPU saturation, and how to minimize serialization overhead.
  • Build the "dataset-as-code": Implement tooling for automated filtering, deduplication, and quality scoring. You ensure that as our models evolve, our datasets evolve with versioned, reproducible states.
  • Infrastructure ownership: Deploy and manage your own workloads on Kubernetes (EKS), ensuring scalability and resource isolation without relying entirely on DevOps.

 

What You'll Need
  • Python native: Expert-level Python. You understand asyncio, multiprocessing, memory management, and how to write efficient code that isn't I/O bound.
  • Unstructured data experience: You have built multi-modal pipelines (e.g., video/audio/images). You understand codecs, file formats (Parquet/WebDataset), and the challenges of small-file overhead on S3.
  • Modern compute frameworks: Practical experience with distributed computing tools like Ray, Dask, or Spark.
  • Containerization & cloud: Comfortable wrapping code in Docker and deploying to Kubernetes. Understanding of cloud storage classes, lifecycle rules and throughput limits. 
  • Systems mindset: You understand that "working code" isn't enough; it must be cost-effective and observable. You care about latency, CPU/GPU utilization, and error handling.
  • Proactive ownership: You don't just flag issues; you bring solutions. You identify bottlenecks early and quickly spin up MVPs to demonstrate the fix without waiting for a ticket.

 

Bonus Points
  • Deep familiarity with FFmpeg, OpenCV, or video compression standards.
  • Exposure to vector databases or embedding generation pipelines.
  • You don't need to be a researcher, but you must know how to run inference for feature extraction within a data pipeline context.

 

Logistics
Location: Remote / Hybrid / London, UK
Context: Fast-paced, high-autonomy environment. We value shipping velocity and architectural simplicity.

 

About BRAHMA AI:
BRAHMA AI is the next generation of enterprise media technology formed through the integration of Prime Focus Technologies and Metaphysic. By combining CLEAR®, CLEAR® AI, ATMAN, and VAANI into one ecosystem, BRAHMA AI enables enterprises to manage, create, and distribute content with intelligence, security, and efficiency.


Proven, scalable, and enterprise-tested, BRAHMA AI is helping global organizations accelerate growth, efficiency, and creative impact in the AI-powered era.