Skip to main content

AI Engineer, AI & Applications

Expired
This role has expired and is no longer accepting applications. Browse similar roles →
Sustainable Metal Cloud (SMC)
Launceston, TAS | Sydney, NSW
Full Time / Permanent

Apply for this job

Posted 1 month ago
This role is expired

These roles are hiring now

View all similar roles →
Featured

Senior AI Platform Engineer

Sportsbet
Sportsbet
Melbourne, VIC
hybrid
  • Design, build and operate Sportsbet's emerging AI Platforms on AWS
  • Deep expertise in AWS cloud services, AI and ML
  • Python, IaC, cloud networking, security, AI/ML architecture
Posted 8d ago

MLOps Engineer

Akuna Capital
Sydney, NSW
  • Lead ML infrastructure, support full ML lifecycle from research to production
  • Experience building/deploying ML infrastructure in production environments
  • Python, C++, Databricks, Spark, CI/CD, model lifecycle management
Posted 1d ago

HPC AI & Kubernetes Platform Engineer

CSIRO
Adelaide, SA | Brisbane, QLD | Darwin, NT | Hobart, TAS | Melbourne, VIC | Perth, WA | Sydney, NSW
hybrid
$118,102 - $127,808 per yr
  • Design, deploy and manage Kubernetes and AI infrastructure at national scale
  • Bachelor's degree or equivalent relevant work experience required
  • Kubernetes, Docker, Python, Linux, IaC tools (Helm, Ansible, Terraform)
Posted 2d ago

HPC AI & Kubernetes Platform Engineer

CSIRO
Canberra, ACT | Melbourne, VIC
$118,102 - $127,808 per yr
  • Design, deploy & manage Kubernetes and AI infrastructure on GPU clusters
  • Relevant Bachelor's degree or equivalent work experience required
  • Kubernetes, Docker, Python, Linux, IaC tools (Helm, Ansible, Terraform)
Posted 3d ago

Role Summary

The AI Engineer will set up and build the MLOps and AIOps foundations for Firmus AI Factory, our AI platform, to make it trustworthy, repeatable, and scalable. This is a pioneering role where you will establish the end-to-end MLOps workflows—turning model development into a disciplined release process with clear governance, automated evaluation gates, and reliable promotion to production. You will also enable our Model Arena initiative by operationalizing the evaluation pipelines and standards so model choices for RAG and agentic applications are data-driven, reproducible, and production-safe. You are also the reliability owner for all Firmus AI Factory AI features: training jobs, inference services, and RAG systems. You'll define quality gates, model promotion workflows, production monitoring, and incident response procedures. Your job is to make AI features as trustworthy as core infrastructure—fast, reliable, and observable. You'll work across the entire team: partnering with engineers on CI/CD gates, with data scientists on quality metrics, and with ops on L2/L3 incident response.

Key Responsibilities

  • Design and own end-to-end MLOps workflows: training → evaluation → registry → deployment → monitoring → retraining/retirement in dev/staging/production environments, with clear standards and ownership boundaries.
  • Own the model registry and promotion lifecycle (MLFlow): stage/alias strategy, approvals, environment separation, access control, and rollback readiness.
  • Establish reproducibility and lineage across the model lifecycle: versioned code/config, artifact traceability, dataset/version references, and repeatable evaluation runs.
  • Design and implement automated model quality gates for production (quality such as accuracy and latency, cost, and safety etc).
  • Define SLOs/SLIs for all AI features: training job success rate, inference latency p99, RAG retrieval accuracy, availability, cost metrics.
  • Build production monitoring dashboards: track model performance, data drift, operational health; integrate with alerting (PagerDuty, Slack, etc.).
  • Create on-call runbooks and triage procedures for AI service incidents; lead postmortem-driven improvements.
  • Instrument AI services for debugging: request traces, GPU metrics per-model, retrieval performance, communication bottlenecks.
  • Integrate evaluation frameworks (benchmarking, RAGAS, LLM-as-judge) into CI/CD pipelines.

Skills & Experience

  • 5–8 years in MLOps / ML platform / production engineering roles with hands-on ownership of production ML delivery pipelines.
  • Deep understanding of ML lifecycle: model versioning, promotion strategies, evaluation automation, governance, deployment strategies, monitoring, drift detection.
  • Hands-on experience with MLflow Model Registry workflows (stages/aliases, approvals, traceability) and integrating registry actions into release pipelines.
  • Experience operationalizing model evaluation systems (metrics standards, orchestration, logging, reproducibility)
  • Strong observability and production fundamentals: metrics/logs/traces, alert design, incident response, and reliability mindset.
  • Familiarity with CI/CD pipelines, model packaging, and deployment automation, comfortable collaborating across ML engineers, platform/SRE, and application teams to turn requirements into robust workflows.
  • Understanding of distributed systems, resource management, and failure modes in training/inferencing environments.

Key Competencies

  • Production Ownership: comfortable owning services in production; proactive about monitoring, alerting, and preventing issues.
  • Reliability Engineering: can define SLOs, error budgets, and blameless postmortem culture.
  • Cross-Functional Leadership: works with ML engineers, data scientists, and platform teams; unblocks reliably.
  • Incident Response: triage skills, root cause analysis, systemic thinking (not just fighting fires).
  • Programmatic automation reduces toil and makes the right path with a balanced rigor with speed.
  • Communication: explains complex ML/systems issues clearly to both technical and non-technical stakeholders.

Success Metrics

  • Reproducible, auditable model release workflow becomes the default across teams (clear lineage and consistent promotion standards).
  • Automated evaluation gates prevent the majority of quality/performance regressions from reaching production.
  • Model registry and deployment practices support safe rollouts and fast rollbacks with minimal disruption.
  • Reliable AI services (SLO-driven): training/inference/RAG services consistently meet reliability targets and error budgets.
  • Faster detection and recovery: incident MTTD/MTTR improves over time and repeated incident classes reduce.
  • Higher signal-to-noise alerting: fewer redundant alerts per true incident through correlation/deduplication improvements.
  • Operational automation maturity increases: more incident classes handled with consistent triage and safe automation.

Location & Reporting

  • Singapore or Australia (Launceston, TAS or Sydney, NSW)
  • Reporting to Head of AI & Applications

Employment Basis

Full-time

Diversity

At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions. Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering.

Apply now to be part of shaping the future of sustainable AI infrastructure.