Role Summary
The AI Engineer will set up and build the MLOps and AIOps foundations for Firmus AI Factory, our AI platform, to make it trustworthy, repeatable, and scalable. This is a pioneering role where you will establish the end-to-end MLOps workflows—turning model development into a disciplined release process with clear governance, automated evaluation gates, and reliable promotion to production. You will also enable our Model Arena initiative by operationalizing the evaluation pipelines and standards so model choices for RAG and agentic applications are data-driven, reproducible, and production-safe. You are also the reliability owner for all Firmus AI Factory AI features: training jobs, inference services, and RAG systems. You'll define quality gates, model promotion workflows, production monitoring, and incident response procedures. Your job is to make AI features as trustworthy as core infrastructure—fast, reliable, and observable. You'll work across the entire team: partnering with engineers on CI/CD gates, with data scientists on quality metrics, and with ops on L2/L3 incident response.
Key Responsibilities
• Design and own end-to-end MLOps workflows: training → evaluation → registry → deployment → monitoring → retraining/retirement in dev/staging/production environments, with clear standards and ownership boundaries.
• Own the model registry and promotion lifecycle (MLFlow): stage/alias strategy, approvals, environment separation, access control, and rollback readiness.
• Establish reproducibility and lineage across the model lifecycle: versioned code/config, artifact traceability, dataset/version references, and repeatable evaluation runs.
• Design and implement automated model quality gates for production (quality such as accuracy and latency, cost, and safety etc).
• Define SLOs/SLIs for all AI features: training job success rate, inference latency p99, RAG retrieval accuracy, availability, cost metrics.
• Build production monitoring dashboards: track model performance, data drift, operational health; integrate with alerting (PagerDuty, Slack, etc.).
• Create on-call runbooks and triage procedures for AI service incidents; lead postmortem-driven improvements.
• Instrument AI services for debugging: request traces, GPU metrics per-model, retrieval performance, communication bottlenecks.
• Integrate evaluation frameworks (benchmarking, RAGAS, LLM-as-judge) into CI/CD pipelines.
Skills & Experience
Key Competencies
Success Metrics
Location & Reporting
Employment Basis
Full-time
Diversity
At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.
Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure.
Firmus Technologies is a global leader pioneering the solution to AI’s energy challenge, founded in Australia in 2019 by a visionary team of entrepreneurs and engineers passionate about sustainable computing infrastructure.
Firmus builds and operates AI infrastructure across Asia-Pacific, utilising its proprietary AI Factory platform to deliver transformative cost-effective GPU clusters and AI cloud services for developers, enterprise, education and government users.
We are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.
Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering.