The Opportunity

We are looking for a Principal ML Engineer to lead the technical architecture and engineering strategy for integrating advanced AI into high-impact Healthcare Information Systems (HIS).

This role focuses on building reliable, scalable, and production-ready ML systems rather than experimental modelling. You will design the MLOps ecosystem and cloud infrastructure needed to move models from development into critical real-world environments. You will play a key role in connecting raw data with robust AI services that operate securely at scale.

Key Responsibilities

1. MLOps & System Architecture

Lead the design and implementation of end-to-end ML lifecycle management, including automated CI/CD pipelines, model versioning (MLflow/DVC), and reproducible experimentation.
Architect high-performance model serving layers for both LLMs and classical ML models, ensuring low latency, high availability, and security within a healthcare cloud environment.
Build infrastructure supporting agent-based reasoning systems, ensuring workflows are traceable, auditable, and integrated into existing HIS platforms.

2. Data Engineering & Infrastructure

Design robust ETL/ELT data pipelines to process healthcare formats such as FHIR, HL7, and DICOM into high-quality features for real-time and batch inference.
Manage and optimise cloud infrastructure (AWS, Azure, or GCP) using Infrastructure as Code tools such as Terraform or Pulumi.
Implement monitoring and observability frameworks to detect data drift, model degradation, and system bottlenecks before impacting outcomes.

3. Technical Leadership & Governance

Act as lead architect for the ML platform, ensuring compliance with HIPAA/HITRUST standards and security-by-design principles.
Establish engineering best practices around code quality, containerisation (Docker/Kubernetes), and documentation.
Mentor engineers and promote an engineering-driven approach to machine learning, focusing on maintainable and scalable solutions.

Candidate Profile

Education & Experience

Master’s or PhD in Computer Science, Software Engineering, or a related technical field.
10+ years of software engineering experience, including 6+ years deploying and maintaining large-scale ML systems in production environments.

Core Technical Skills

Expert knowledge of cloud platforms (AWS, GCP, or Azure) and orchestration tools such as Kubernetes, Kubeflow, or Airflow.
Strong programming expertise in Python and Java, Go, or similar.
Solid backend engineering and system design experience.
Data engineering expertise with tools such as Spark, Snowflake, or Databricks, including scalable feature store design.
Hands-on experience deploying Generative AI (LLMs) and agentic frameworks (LangChain/LangGraph) within containerised microservices environments.

Preferred Experience

GPU optimisation, model quantisation, or specialised serving frameworks such as vLLM or TGI.
Security and compliance experience within regulated industries like healthcare, finance, or defence.
Strong distributed systems design skills, including high concurrency and large-scale data processing.