Senior Site Reliability Engineer

1h1 hour ago

Omilia

AU · Full-time · A$160,000 – A$190,000

About this role

We are looking for a Senior Site Reliability Engineer with Cloud platform experience. You will be part of a team responsible for operating and maintaining production clusters and developing our observability solutions. Collaborate with team members to develop automation strategies and ensure overall platform reliability.

You will be the first response for incidents, contributing to problem management and root cause analysis. Design and evolve observability solutions using tools like Prometheus, Grafana, and ELK. Participate in on-call rotations and continuously improve alert quality and response processes.

Collaborate with engineering teams to embed reliability and performance into the software delivery lifecycle. Develop troubleshooting documentation and optimized runbooks for production support. Champion a culture of reliability, performance, and continuous improvement across teams.

Your goal will be to become an integral part of the team, making every challenge of the platform your own. Grow professionally through courses and training while working on cutting-edge technology products making a global impact. Enjoy Apple gear and collaborate with proficient, fun colleagues.

Requirements

Bachelor's Degree or MS in Engineering or equivalent.
Experience operating at least one container orchestration cluster (Kubernetes, Docker Swarm).
Experience developing or maintaining software for production services at scale.
Experience with ELK.
Experience with AWS.
Experience with Grafana/Prometheus stack.
Strong scripting skills (Bash, Python or Go).
Excellent communication skills.

Responsibilities

Ensure platform reliability and availability across production and pre-production environments through proactive monitoring, alerting, and automation.
First response for incidents, contributing to problem management and root cause analysis.
Design, implement, and evolve observability solutions (metrics, logs, traces, dashboards) using tools such as Prometheus, Grafana, and ELK.
Collaborate with development and cloud engineering teams to embed reliability and performance into the software delivery lifecycle.
Develop troubleshooting documentation for production support resources and collaborate on optimized runbooks.
Participate in on-call rotations and continuously improve alert quality and response processes.
Champion a culture of reliability, performance, and continuous improvement across teams.

Benefits

Fixed compensation.
Long-term employment with working days vacation.
Development in professional growth (courses, training, etc.).
Being part of successful cutting-edge technology products that are making a global impact in the service industry.
Proficient and fun-to-work-with colleagues.
Apple gear.