Skip to main content

AI Engineer, AI & Applications

Expired
This role has expired and is no longer accepting applications. Browse similar roles →
Firmus Technologies
Launceston, TAS | Sydney, NSW
Full Time

Apply for this job

Posted 3 months ago
This role is expired

These roles are hiring now

View all similar roles →
Featured

Senior AI Platform Engineer

Sportsbet
Sportsbet
Melbourne, VIC
hybrid
  • Design, build and operate Sportsbet's emerging AI Platforms on AWS
  • Deep expertise in AWS cloud services, AI and ML
  • Python, IaC, cloud networking, security, AI/ML architecture
Posted 21d ago

Support Engineer

ResetData
Sydney NSW
  • Hands-on technical support across cloud, storage, and Kubernetes
  • Strong command-line skills and a drive to learn and troubleshoot
  • Sydney-based; must be eligible for NV1 security clearance
Posted 1d ago

Network Engineer

ResetData
Sydney NSW
  • Support day-to-day network operations and data-centre activities
  • Configure and troubleshoot routers, switches, and firewalls
  • Sydney-based; must be eligible for NV1 security clearance
Posted 1d ago

Systems Engineer

ResetData
Sydney NSW
  • Build and operate infra across bare metal, K8s, storage, and security
  • Deep Linux fluency with hands-on data-centre and GPU workloads
  • GitOps multi-cluster Kubernetes, Ansible, and CVE mitigation
Posted 1d ago

Role Summary

The AI Engineer will set up and build the MLOps and AIOps foundations for Firmus AI Factory, our AI platform, to make it trustworthy, repeatable, and scalable. This is a pioneering role where you will establish the end-to-end MLOps workflows—turning model development into a disciplined release process with clear governance, automated evaluation gates, and reliable promotion to production. You will also enable our Model Arena initiative by operationalizing the evaluation pipelines and standards so model choices for RAG and agentic applications are data-driven, reproducible, and production-safe. You are also the reliability owner for all Firmus AI Factory AI features: training jobs, inference services, and RAG systems. You'll define quality gates, model promotion workflows, production monitoring, and incident response procedures. Your job is to make AI features as trustworthy as core infrastructure—fast, reliable, and observable. You'll work across the entire team: partnering with engineers on CI/CD gates, with data scientists on quality metrics, and with ops on L2/L3 incident response.


Key Responsibilities

•    Design and own end-to-end MLOps workflows: training → evaluation → registry → deployment → monitoring → retraining/retirement in dev/staging/production environments, with clear standards and ownership boundaries. 
•    Own the model registry and promotion lifecycle (MLFlow): stage/alias strategy, approvals, environment separation, access control, and rollback readiness. 
•    Establish reproducibility and lineage across the model lifecycle: versioned code/config, artifact traceability, dataset/version references, and repeatable evaluation runs. 
•    Design and implement automated model quality gates for production (quality such as accuracy and latency, cost, and safety etc). 
•    Define SLOs/SLIs for all AI features: training job success rate, inference latency p99, RAG retrieval accuracy, availability, cost metrics. 
•    Build production monitoring dashboards: track model performance, data drift, operational health; integrate with alerting (PagerDuty, Slack, etc.). 
•    Create on-call runbooks and triage procedures for AI service incidents; lead postmortem-driven improvements. 
•    Instrument AI services for debugging: request traces, GPU metrics per-model, retrieval performance, communication bottlenecks. 
•    Integrate evaluation frameworks (benchmarking, RAGAS, LLM-as-judge) into CI/CD pipelines.


Skills & Experience

  • 5–8 years in MLOps / ML platform / production engineering roles with hands-on ownership of production ML delivery pipelines.
  • Deep understanding of ML lifecycle: model versioning, promotion strategies, evaluation automation, governance, deployment strategies, monitoring, drift detection.
  • Hands-on experience with MLflow Model Registry workflows (stages/aliases, approvals, traceability) and integrating registry actions into release pipelines.
  • Experience operationalizing model evaluation systems (metrics standards, orchestration, logging, reproducibility)
  • Strong observability and production fundamentals: metrics/logs/traces, alert design, incident response, and reliability mindset.
  • Familiarity with CI/CD pipelines, model packaging, and deployment automation, comfortable collaborating across ML engineers, platform/SRE, and application teams to turn requirements into robust workflows.
  • Understanding of distributed systems, resource management, and failure modes in training/inferencing environments.


Key Competencies

  • Production Ownership: comfortable owning services in production; proactive about monitoring, alerting, and preventing issues.
  • Reliability Engineering: can define SLOs, error budgets, and blameless postmortem culture.
  • Cross-Functional Leadership: works with ML engineers, data scientists, and platform teams; unblocks reliably.
  • Incident Response: triage skills, root cause analysis, systemic thinking (not just fighting fires).
  • Programmatic automation reduces toil and makes the right path with a balanced rigor with speed.
  • Communication: explains complex ML/systems issues clearly to both technical and non-technical stakeholders.


Success Metrics

  • Reproducible, auditable model release workflow becomes the default across teams (clear lineage and consistent promotion standards).
  • Automated evaluation gates prevent the majority of quality/performance regressions from reaching production.
  • Model registry and deployment practices support safe rollouts and fast rollbacks with minimal disruption.
  • Reliable AI services (SLO-driven): training/inference/RAG services consistently meet reliability targets and error budgets.
  • Faster detection and recovery: incident MTTD/MTTR improves over time and repeated incident classes reduce.
  • Higher signal-to-noise alerting: fewer redundant alerts per true incident through correlation/deduplication improvements.
  • Operational automation maturity increases: more incident classes handled with consistent triage and safe automation.


Location & Reporting

  • Singapore or Australia (Launceston, TAS or Sydney, NSW)
  • Reporting to Head of AI & Applications


Employment Basis

Full-time


Diversity

At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions. 

Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure. 

About Firmus Technologies

Firmus Technologies is a global leader pioneering the solution to AI’s energy challenge, founded in Australia in 2019 by a visionary team of entrepreneurs and engineers passionate about sustainable computing infrastructure.

Firmus builds and operates AI infrastructure across Asia-Pacific, utilising its proprietary AI Factory platform to deliver transformative cost-effective GPU clusters and AI cloud services for developers, enterprise, education and government users.

We are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.

Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering.