AI Engineer, AI & Applications at Sustainable Metal Cloud (SMC) (Expired)

AI Jobs Australia

AI Engineer, AI & Applications

Expired

This role has expired and is no longer accepting applications. Browse similar roles →

Sustainable Metal Cloud (SMC)

Launceston, TAS | Sydney, NSW

Full Time / Permanent

Apply for this job

Posted 3 months ago

This role is expired

These roles are hiring now

View all similar roles →

AI Factory Customer Engineer

Armada

Sydney, NSW

remote

Technical interface between customers and Product/Engineering teams
5+ years data center engineering, infrastructure or solution architecture
Liquid-cooled data centers, modular data centers, NVIDIA GPU systems

Posted 1d ago

Senior DevOps AI Engineer

Obsidian Security

Australia

remote

DevOps engineering for AI/ML infrastructure and deployment pipelines
5+ years DevOps experience with AI/ML systems
Kubernetes, CI/CD, cloud platforms, ML operations

Posted 1d ago

AI Infrastructure Lead (AU)

DroneShield

Sydney, NSW

Lead AI Infrastructure team, drive delivery and improve team efficiency
7+ years infrastructure/SRE/platform engineering experience
Kubernetes, Linux, MLOps, distributed systems, team leadership

Posted 6d ago

Azure Platform Consultant

Arinco

Melbourne, VIC

hybrid

Deliver Azure migrations, infrastructure solutions and AI workload implementatio
5+ years Azure infrastructure and enterprise migration experience
Azure Bicep, Terraform, IaC, GitHub, DevOps, AI technologies

Posted 9d ago

Role Summary

The AI Engineer will set up and build the MLOps and AIOps foundations for Firmus AI Factory, our AI platform, to make it trustworthy, repeatable, and scalable. This is a pioneering role where you will establish the end-to-end MLOps workflows—turning model development into a disciplined release process with clear governance, automated evaluation gates, and reliable promotion to production. You will also enable our Model Arena initiative by operationalizing the evaluation pipelines and standards so model choices for RAG and agentic applications are data-driven, reproducible, and production-safe. You are also the reliability owner for all Firmus AI Factory AI features: training jobs, inference services, and RAG systems. You'll define quality gates, model promotion workflows, production monitoring, and incident response procedures. Your job is to make AI features as trustworthy as core infrastructure—fast, reliable, and observable. You'll work across the entire team: partnering with engineers on CI/CD gates, with data scientists on quality metrics, and with ops on L2/L3 incident response.

Key Responsibilities

Design and own end-to-end MLOps workflows: training → evaluation → registry → deployment → monitoring → retraining/retirement in dev/staging/production environments, with clear standards and ownership boundaries.
Own the model registry and promotion lifecycle (MLFlow): stage/alias strategy, approvals, environment separation, access control, and rollback readiness.
Establish reproducibility and lineage across the model lifecycle: versioned code/config, artifact traceability, dataset/version references, and repeatable evaluation runs.
Design and implement automated model quality gates for production (quality such as accuracy and latency, cost, and safety etc).
Define SLOs/SLIs for all AI features: training job success rate, inference latency p99, RAG retrieval accuracy, availability, cost metrics.
Build production monitoring dashboards: track model performance, data drift, operational health; integrate with alerting (PagerDuty, Slack, etc.).
Create on-call runbooks and triage procedures for AI service incidents; lead postmortem-driven improvements.
Instrument AI services for debugging: request traces, GPU metrics per-model, retrieval performance, communication bottlenecks.
Integrate evaluation frameworks (benchmarking, RAGAS, LLM-as-judge) into CI/CD pipelines.

Skills & Experience

5–8 years in MLOps / ML platform / production engineering roles with hands-on ownership of production ML delivery pipelines.
Deep understanding of ML lifecycle: model versioning, promotion strategies, evaluation automation, governance, deployment strategies, monitoring, drift detection.
Hands-on experience with MLflow Model Registry workflows (stages/aliases, approvals, traceability) and integrating registry actions into release pipelines.
Experience operationalizing model evaluation systems (metrics standards, orchestration, logging, reproducibility)
Strong observability and production fundamentals: metrics/logs/traces, alert design, incident response, and reliability mindset.
Familiarity with CI/CD pipelines, model packaging, and deployment automation, comfortable collaborating across ML engineers, platform/SRE, and application teams to turn requirements into robust workflows.
Understanding of distributed systems, resource management, and failure modes in training/inferencing environments.

Key Competencies

Production Ownership: comfortable owning services in production; proactive about monitoring, alerting, and preventing issues.
Reliability Engineering: can define SLOs, error budgets, and blameless postmortem culture.
Cross-Functional Leadership: works with ML engineers, data scientists, and platform teams; unblocks reliably.
Incident Response: triage skills, root cause analysis, systemic thinking (not just fighting fires).
Programmatic automation reduces toil and makes the right path with a balanced rigor with speed.
Communication: explains complex ML/systems issues clearly to both technical and non-technical stakeholders.

Success Metrics

Reproducible, auditable model release workflow becomes the default across teams (clear lineage and consistent promotion standards).
Automated evaluation gates prevent the majority of quality/performance regressions from reaching production.
Model registry and deployment practices support safe rollouts and fast rollbacks with minimal disruption.
Reliable AI services (SLO-driven): training/inference/RAG services consistently meet reliability targets and error budgets.
Faster detection and recovery: incident MTTD/MTTR improves over time and repeated incident classes reduce.
Higher signal-to-noise alerting: fewer redundant alerts per true incident through correlation/deduplication improvements.
Operational automation maturity increases: more incident classes handled with consistent triage and safe automation.

Location & Reporting

Singapore or Australia (Launceston, TAS or Sydney, NSW)
Reporting to Head of AI & Applications

Employment Basis

Full-time

Diversity

At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions. Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering.

Apply now to be part of shaping the future of sustainable AI infrastructure.