The Opportunity
We are looking for a Principal ML Engineer to lead the technical architecture and engineering strategy for integrating advanced AI into high-impact Healthcare Information Systems (HIS).
This role focuses on building reliable, scalable, and production-ready ML systems rather than experimental modelling. You will design the MLOps ecosystem and cloud infrastructure needed to move models from development into critical real-world environments. You will play a key role in connecting raw data with robust AI services that operate securely at scale.
Key Responsibilities
1. MLOps & System Architecture
- Lead the design and implementation of end-to-end ML lifecycle management, including automated CI/CD pipelines, model versioning (MLflow/DVC), and reproducible experimentation.
- Architect high-performance model serving layers for both LLMs and classical ML models, ensuring low latency, high availability, and security within a healthcare cloud environment.
- Build infrastructure supporting agent-based reasoning systems, ensuring workflows are traceable, auditable, and integrated into existing HIS platforms.
2. Data Engineering & Infrastructure
- Design robust ETL/ELT data pipelines to process healthcare formats such as FHIR, HL7, and DICOM into high-quality features for real-time and batch inference.
- Manage and optimise cloud infrastructure (AWS, Azure, or GCP) using Infrastructure as Code tools such as Terraform or Pulumi.
- Implement monitoring and observability frameworks to detect data drift, model degradation, and system bottlenecks before impacting outcomes.
3. Technical Leadership & Governance
- Act as lead architect for the ML platform, ensuring compliance with HIPAA/HITRUST standards and security-by-design principles.
- Establish engineering best practices around code quality, containerisation (Docker/Kubernetes), and documentation.
- Mentor engineers and promote an engineering-driven approach to machine learning, focusing on maintainable and scalable solutions.
Candidate Profile
Education & Experience
- Master’s or PhD in Computer Science, Software Engineering, or a related technical field.
- 10+ years of software engineering experience, including 6+ years deploying and maintaining large-scale ML systems in production environments.
Core Technical Skills
- Expert knowledge of cloud platforms (AWS, GCP, or Azure) and orchestration tools such as Kubernetes, Kubeflow, or Airflow.
- Strong programming expertise in Python and Java, Go, or similar.
- Solid backend engineering and system design experience.
- Data engineering expertise with tools such as Spark, Snowflake, or Databricks, including scalable feature store design.
- Hands-on experience deploying Generative AI (LLMs) and agentic frameworks (LangChain/LangGraph) within containerised microservices environments.
Preferred Experience
- GPU optimisation, model quantisation, or specialised serving frameworks such as vLLM or TGI.
- Security and compliance experience within regulated industries like healthcare, finance, or defence.
- Strong distributed systems design skills, including high concurrency and large-scale data processing.